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Abstract 

Round  trip  engineering  of  software  from  source  code  and  reverse  engineering  of 
software  from  binary  files  have  both  been  extensively  studied  and  the  state-of-practice 
have  documented  tools  and  techniques.  Forward  engineering  of  protocols  has  also 
been  extensively  studied  and  there  are  firmly  established  techniques  for  generating 
correct  protocols.  While  observation  of  protocol  behavior  for  performance  testing  has 
been  studied  and  techniques  established,  reverse  engineering  of  protocol  control  flow 
from  observations  of  protocol  behavior  has  not  received  the  same  level  of  attention. 
State-of-practice  in  reverse  engineering  the  control  flow  of  computer  network  proto¬ 
cols  is  comprised  of  mostly  ad  hoc  approaches.  We  examine  state-of-practice  tools 
and  techniques  used  in  three  open  source  projects:  Pidgin,  Samba,  and  rdesktop.  We 
examine  techniques  proposed  by  computational  learning  researchers  for  grammatical 
inference.  We  propose  to  extend  the  state-of-art  by  inferring  protocol  control  flow 
using  grammatical  inference  inspired  techniques  to  reverse  engineer  automata  repre¬ 
sentations  from 
is  applicable  to 


captured  data  flows.  We  present  evidence  that  grammatical  inference 
the  problem  domain  under  consideration. 
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Dynamic  Protocol  Reverse  Engineering 
A  Grammatical  Inference  Approach 

I.  Introduction 

As  the  United  States  Air  Force  (USAF)  extends  into  the  Cyberspace  domain, 
the  ease  of  breaking  into  computer  networks  and  misusing  distributed  systems  has 
become  increasingly  problematic  [163,172,173,242,280].  Deep  understanding  of  the 
protocols  which  traverse  computer  networks  and  enable  distributed  systems  is  increas¬ 
ingly  important  to  securing  our  computer  networks  and  putting  opponent’s  networked 
operations  at  risk. 

1.1  Operations  in  Cyberspace 

The  DOD  defines  Cyberspace  as  a  domain  characterized  by  the  use  of  electronics 
and  the  electromagnetic  spectrum  to  store,  modify,  and  exchange  data  via  networked 
systems  and  associated  infrastructures.  Operations  in  Cyberspace  have  both  strategic 
and  tactical  requirements.  Tactics,  Techniques  and  Procedures  coupled  with  weapons 
systems  that  produce  reliable  and  predictable  battle  effects  are  essential.  Freedom 
of  Cyberspace  much  like  Freedom  of  the  Seas  and  Freedom  of  the  Skies  has  become 
essential  to  our  way  of  life.  As  such,  our  current  inability  to  operate  in  Cyberspace 
as  a  domain  of  military  operations,  governed  by  mathematical  and  electromagnetic 
principles,  requires  us  to  develop,  train  and  equip  cyber  forces  that  can  guarantee 
Freedom  of  Cyberspace.  In  the  words  of  Secretary  Wynne  [280]: 

Cyberspace  is  a  domain  for  projecting  and  protecting  national  power,  for 

both  strategic  and  tactical  operations  [212], 

Vulnerabilities  in  technical  standards  and  concrete  implementations  of  technical 
standards  are  cyber  warriors  fighting  positions.  Freedom  of  Cyberspace  will  require 
tactical  cyber  power:  the  ability  to  degrade,  disrupt,  deny  and  destroy  adversaries 
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fighting  positions  while  defending  our  own.  Deep  understanding  of  distributed  system 
is  a  critical  enabler  to  developing  cyber  power  in  network  centric  cyber  spaces. 

In  this  thesis  we  focus  on  protocol  reverse  engineering  as  a  method  that  enables 
the  generation  of  instruments  of  tactical  cyber  power  in  digital  computer  networks. 
We  do  not  discuss  the  broader  topics  of  computer  network  exploitation/protection  or 
electronic  warfare.  In  fact,  we  view  protocol  reverse  engineering  as  only  one  facet  of 
the  larger  topic  of  tactical  cyber  power.  Also,  we  will  concentrate  on  technical  means 
that  enable  creation  of  tactical  cyber  weapons  over  doctrine,  organization,  and  policy. 

At  the  outset  of  performing  the  preliminary  literature  review  it  was  apparent 
that  the  volume  of  academic  information  concerning  protocol  forward  engineering 
greatly  exceeded  the  volume  of  academic  information  on  protocol  reverse  engineering. 
It  is  our  contention  that  the  state-of-the-art  in  protocol  reverse  engineering  methods 
and  tools  remains  largely  shrouded  from  the  view  of  the  general  public. 

Due,  in  part,  to  the  underground  nature  of  the  subject,  effective  application  of 
protocol  reverse  engineering  to  generate  effective  instruments  of  tactical  cyber  power 
is  a  challenging  problem. 

Generation  of  tactical  cyber  weapons  requires  a  deep  undertanding  of  the  tech¬ 
nical  architecture  of  the  systems  under  consideration.  A  cyber  weapon  must  provide 
effective,  reliable,  and  repeatable,  battlespace  effects. 

A  first  step  is  to  recover  models  of  protocols  that  increase  analyst  understanding 
and  support  formal  analysis  to  verify  the  effects  of  network  centric  cyber  weapons. 

1.2  Problem  Domain 

Correct  protocol  design  is  a  difficult  engineering  task.  Gerard  Holzmann  offers 
the  following: 

It  is  the  unexpected  sequences  of  events  that  lead  to  protocol  failures,  and 

the  hardest  problem  in  protocol  design  is  precisely  that  we  must  try  to 

expect  the  unexpected  [109]. 
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Protocol  model  recovery  from  network  traffic  is  a  challenging  problem. 


!  While  the  algorithm  domain  under  consideration  is  proveably  NPC  we 
do  not  provide  proof  that  the  problem  domain  is  NPC.  We  only  conjecture 
the  problem  domain  is  NPC. 

Given  the  complexity  of  correctly  designing  a  protocol  specification  and  then 
accurately  engineering  a  protocol  implementation  it  is  not  surprising  that  protocols 
exhibit  vulnerabilities.  Causes  of  unexpected  conditions  that  expose  vulnerabilities 
range  from  accidental  oversight  [220]  to  deliberate  attack  [12,  Section  7.2], 

The  problem  domain  under  consideration  is  design  recovery  of  protocol  models 
from  captured  data  flows.  Ultimately,  the  recovered  designs  should  support  formal 
analysis  that  identifies  implementation  issues  that  allow  deliberate  attack  or  acciden¬ 
tal  failure.  In  essence,  can  we  discover  protocol  implementation  issues  that  allow 
deliberately  crafted  packets  which  lead  a  protocol  parser  to  unexpected  conditions? 

1.3  Related  Problem  Domains 

Network  traffic  classification  and  deep  packet  inspection  are  related  to  proto¬ 
col  reverse  engineering.  Both  domains  require  understanding  of  protocols  that  might 
not  be  documented  in  open  specifications  [76,126,127,185,193,200,264],  Likewise, 
signature  based  intrusion  detection  requires  deep  knowledge  of  protocols’  inner  work¬ 
ings  [178].  We  conjecture  that  behavioral  based  intrusion  detection  could  also  benefit 
from  models  recovered  via  protocol  reverse  engineering.  Finally,  protocol  reverse  engi¬ 
neering  can  draw  practical  methods  from  the  domain  of  protocol  conformance  testing. 

While  we  concentrated  on  a  subset  of  application  level  protocols  on  IPv4  net¬ 
works  similar  experimental  analysis  could  be  conducted  against  other  classes  of  pro¬ 
tocols,  such  as  SCADA  0  or  SS7  |^J  for  vulnerability  assessment  and  potentially  gen¬ 
eration  of  targeted  effects. 

Supervisory  Control  and  Data  Acquisition  -  [47,79,134]  introduce  the  subject.  [47,  Section  6.7, 
Chap  12]  covers  TCP/IP  encapsulated  SCADA  communications. 

Signaling  System  7  -  a  set  of  telephony  signaling  protocols 
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1.4  Related  Application  Domains 

Automated  specification  generation,  automated  test  generation  and  automated 
conformance  testing  provide  architectures  that  are  useful  for  protocol  reverse  engi¬ 
neering.  Dssouli  et  al  presents  a  test  automation  architecture  for  distributed  systems 
in  [68].  Ammons  examines  automated  specification  generation  from  program  execu¬ 
tion  traces  [6] .  Automated  test  generation  for  white-box  and  black-box  testing  is  well 
studied  for  software  testing.  Random  boundary  testing  methods  for  network  protocols 
are  covered  by  [249,  Chapter  14]  and  for  software  testing  in  [175,188]. 

Tretmans  covers  OSI  protocol  conformance  testing  in  [257]  while  Berg  examines 
the  similarities  between  regular  inference  and  conformance  testing  in  “On  the  Corre¬ 
spondence  Between  Conformance  Testing  and  Regular  Inference”  [22],  Conformance 
testing  as  an  Angluin  query  styled  learning  problem  is  also  examined  by  Lai  in  [136] 
which  presents  a  genetic  algorithm  approach  to  adaptive  model  checking. 

1.5  Investigative  Questions 

The  focus  of  this  research  is  the  evaluation  of  existing  Grammatical  Inference 
(GI)  algorithms  for  the  dynamic  protocol  reverse  engineering  domain.  GI  is  a  branch 
of  artificial  intelligence  that  concentrates  on  the  inference  of  grammatical  structures 
from  samples  of  a  language.  In  particular,  we  ask  the  following  questions: 

IQ1  What  information  is  necessary  to  reverse  engineer  the  control  portion  of  appli¬ 
cation  layer  protocols  from  data  flows? 

IQ2  Given  the  proven  [7, 95]  difficulty  of  inferring  finite  automata  from  positive  sam¬ 
ples  only,  are  there  GI  approaches  that  are  appropriate  for  reverse  engineering 
automata  representations  of  the  control  portion  of  application  layer  protocols 
from  data  flows? 

We  propose  to  apply  GI  algorithms  to  recover  structure  from  the  protocol  stream 
that  is  not  immediately  obvious  from  observation  of  individual  packets.  We  hope  to 
advance  the  state-of-art  in  protocol  reverse  engineering  by  automatically  revealing 
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structural  relationships  much  like  an  oscilloscope  displays  waveforms  from  an  electric 
signal.  We  will  concentrate  on  a  posteriori  analysis  instead  of  online  analysis  of  live 
execution  traces. 

While  we  discuss  computer  science  oriented  theoretical  aspects  of  GI  we  are 
more  interested  in  the  engineering  oriented  pragmatic  recovery  of  structure  and  the 
presentation  of  experimental  evidence  that  establishes  the  applicability  of  GI  to  the 
problem  of  protocol  model  recovery. 

Finally,  we  recognize  that  the  approach  presented  is  only  a  partial  solution  to 
the  problem  domain  under  consideration.  Human  analysts  must  continue  to  apply 
common  heuristics  (e.g.  identifying  signpost  values,  block  structure  inference,  or 
windowed  entropy). 


1.6  Document  Overview 

In  Chapter  [TT|  we  present  the  problem  domain  under  consideration.  Next,  in 


Chapter  III  we  introduce  the  Chomsky  Hierarchy  as  a  framework  for  discussing 
computational  learnability  and  overview  several  existing  grammatical  inference  al¬ 


gorithms.  In  Chapter  [TV]  we  describe  the  experimental  architecture  used  to  evaluate 
two  selected  gramatical  inference  algorithms  against  POP3  and  SMTP  traffic  from 
the  IDEVAL  data  set.  Finally,  in  Chapter  [V]  we  provide  the  results  of  our  evaluation 
and  propose  areas  that  can  be  refined  in  future  work. 
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II.  Problem  Domain 


This  chapter  provides  background  regarding  the  problem  domain.  First,  we  describe 
the  problem  domain  under  consideration  with  with  an  English  description.  We  dis¬ 
cuss  distributed  systems  and  issues  related  to  protocol  design  recovery.  Finally,  we 
introduce  protocol  reverse  engineering  and  overview  state-of-practice  and  state-of-art 
reverse  engineering  tools  and  techniques. 

2.1  Distributed  Systems 

A  distributed  system  in  the  most  abstract  sense  is  comprised  of  three  elements: 
nodes  the  processes  executing  on  servers,  desktops,  or  sensors;  links  cable  plant, 
air,  or  other  physical  transmission  medium;  and  protocols.  A  protocol  is  a  kind  of 
agreement  about  the  exchange  of  messages  in  a  distributed  system  [49,109,250].  A 
complete  protocol  definition  is  very  similar  to  a  language  description:  it  defines  a 
strict  syntactical  format  for  valid  messages;  it  defines  data  exchange  procedure  rules; 
and  it  defines  semantics,  a  vocabulary  of  valid  messages  and  their  meaning  [109].  The 
protocols  grammar  must  be  logically  consistent  and  complete.  The  procedure  rules 
should  explicitly  specify  what  is  permitted  or  forbidden  [109].  Finally,  the  sender  and 
receiver  must  implement  compatible  rules  for  communication  to  succeed  [109]. 

A  distributed  system  can  be  abstracted  by  dividing  it  into  application  stacks, 
the  components  that  make  up  the  nodes;  and  protocol  stacks,  the  layered  architecture 
that  implements  the  rules  of  communication.  The  application  stacks  are  the  oper¬ 
ating  systems  and  application  software  on  any  given  node.  To  avoid  combinatorial 
state  explosion,  protocols  for  distributed  systems  are  often  designed,  developed,  and 
implemented  in  layered  architectures  [49,150,252],  Each  layer  of  the  communications 
architecture  implements  services  that  are  presented  to  the  more  abstract  layers  above 
it. 

Figure pTTjshows  the  Open  Systems  Interconnect  (OSI)  reference  model  proposed 
by  International  Organization  for  Standards  (ISO)  in  1982  [236]. 
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System 


Logical  Connection  in  each  Layer  System  2 
Realisation  of  the  communication 


Figure  2.1:  OSI  Reference  Model  [236]. 


The  protocol  stack  is  the  composite  of  the  layers  that  are  utilized  by  a  distributed 
system.  A  protocol  stack  implements  a  protocol  reference  model.  The  Transmission 
Control  Protocol  (TCP)  /  Internet  Protocol  (IP)  protocol  family  concentrates  on  the 
transport  and  network  layers  of  the  OSI  reference  model  [49,243].  The  protocol  stack 
that  supports  a  distributed  system  is  completed  by  adding  a  data  link  and  physical 
link  implementation,  such  as  Ethernet  over  fiber  optic  cable.  Vulnerabilities  in  a 
protocol  stack  can  be  leveraged  as  a  propagation  vector  for  attacks  on  an  application 
stack  [12,  Section  7.2],  Methods  to  exploit  known  vulnerabilities  are  readily  available 
in  pre-packaged  frameworks  such  as  Cain  &  AbeQ  [176]  and  Metasploit^]  [83]. 

To  limit  the  scope  of  our  research,  we  have  selected  to  focus  on  application  layer 
protocols  and  concentrate  on  the  protocol  stack  over  the  application  stack.  Specifi¬ 
cally,  we  will  concentrate  on  application  layer  protocols  that  use  the  TCP/IP  protocol 
family  version  4  (IPv4)  which  defines  much  of  the  modern  Internet  [49,243].  We  con¬ 
sider  TCP/IP  exchanges  of  TCP  packets  that  encapsulate  application  layer  protocols. 

■'Vain  &  Abel  -  http://www.oxid.it/cain.html 
2Metasploit  -  http://www.metasploit.com 
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Figure  2.2:  Internet  from  TCP  Perspective  [78]. 

Constraining  the  OSI  reference  model  to  just  TCP  connections  leads  to  Figure  |2.2[ 
TCP,  the  dominant  transport  protocol  on  the  Internet  [119],  uses  the  communications 
system  at  the  layers  below  it  as  a  simple  black  box  and  does  not  concern  itself  with 
the  layers  in  the  model  above  it. 

Likewise  distributed  systems  which  implement  application  level  protocols  use 
TCP  and  the  lower  levels  as  a  black  box. 

A  Finite  State  Machine  (FSM)  is  used  to  model  any  device  that  reacts  to  its 
environment  and  changes  its  state  according  to  the  inputs.  The  FSM  model  is  often 
extended,  to  include  outputs,  as  a  Mealy-Machine.  A  well  formed  TCP  packet  has  a 
source  TCP/IP  implementation  (Host  A)  that  uses  an  automaton  to  keep  track  of  the 
state  of  a  particular  connection  to  a  destination  TCP  automata  (Host  B)  [33,49,78]. 

A  client  application  communicates  with  a  server  application  through  a  TCP 
client  that  connects  to  a  TCP  server  through  the  network  [33].  While  a  TCP  con¬ 
nection  is  identified  by  Source  Address/Port  and  Destination  Address/Port  pair;  the 
temporal  relationship  is  actually  determined  by  the  state  of  the  TCP  Server/TCP 
Client  pair  at  the  source  and  TCP  Server/TCP  Client  pair  at  the  destination.  The 
distributed  systems  client  application  and  server  application  also  maintain  separate 
state  automata  which  transition  states  depending  on  the  operators  received.  As  shown 
in  Figure  [273]  when  a  distributed  system  uses  connection  oriented  TCP  as  a  transport 
we  are  really  dealing  with  at  least  four  automata  in  each  direction. 


Client 

TCP 

Application 

Client 

Internet 


TCP 

-* 

Server 

Server 

<- 

Application 

Figure  2.3:  TCP  Client/Server  Topology  [33]. 
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Figure  2.4:  IP  Packet  Structure  -  32-bit  wide  IPv4  IP  packet 
structure  [93]. 


Because  we  are  considering  application  level  protocols  transported  over  IPv4 
connections,  this  naturally  gives  rise  to  data  structures  embodied  in  protocol  au¬ 
tomata,  formats  of  protocol  operations,  and  data  relationships  from  state  transitions. 


2.1.1  Data  Structures.  The  three  primary  data  structures  in  our  selected 
problem  domain  are  the  IP  packet  structure,  shown  in  Figure  [214]  the  User  Datagram 


Protocol  (UDP)  packet  structure,  shown  in  Figure  2.5,  and  TCP  packet  structure 
shown  in  Figure  2T.  Figure  2.4[  Figure  2J3,  and  Figure  243  are  laid  out  so  they  are 
32-bits  wide. 


The  UDP  and  TCP  packets  are  encapsulated  into  IP  packets  at  the  network 
transport  layer  so  their  source  and  destination  IP  numbers  are  derived  from  the  32- 
bit  IP  Source  Address  and  32-bit  IP  Destination  Address.  The  16-bit  wide  source  and 
destination  ports  are  included  in  the  packet  structure  (UDP  or  TCP)  that  makes  up 
the  IP  payload  [49]. 
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Data 

Figure  2.5:  UDP  Packet  Structure  -  32-bit  wide  IPv4  UDP 
packet  structure  [93]. 

The  UDP  packet  structure  is  rather  simple  because  the  protocol  does  not  pro¬ 
vide  connection  oriented  features.  Distributed  systems  that  use  UDP  for  transport 
can  be  considered  connectionless  in  the  transport  layer  and  must  provide  their  own 
mechanism  for  re-transmission  of  failed  communication  [49].  Example  uses  for  UDP 
are  Domain  Name  Service  (DNS)  and  games  such  as  World  of  Warcraft. 

The  TCP  packet  structure  requires  more  information  to  support  reliable  commu¬ 
nications  service.  The  TCP  protocol  provides  for  reliability,  flow  control,  multiplexing, 
precedence,  security,  and  connection  oriented  transfers  [49, 243] 

2.1.2  Data  Relationships.  Data  relationships  between  packets  occur  at  dif¬ 
ferent  levels  of  granularity:  packet,  connection  or  session.  Another  relationship  is  the 
temporal  ordering  of  packet  arrivals  which  can  be  disturbed  by  packet  fragmentation. 
And  a  third  is  the  possible  causality  of  connection  and  session  arrivals.  Understand¬ 
ing  these  relationships  is  vital  to  choosing  the  parameters  for  clustering  packets  from 
traces  into  unique  conversations  between  application  level  protocol  endpoints. 

2. 1.2.1  Granularity.  At  the  transport  level  TCP  maintains  a  com¬ 
munication  channel  by  using  counters  and  synchronization  flags.  Application  level 
protocols  might  also  exhibit  causal  session  structures.  Communications  at  the  ap¬ 
plication  level  also  often  have  a  logical  session  structure  that  associate  packets  and 


10 


0  12  3 

01234567890123456789012345678901 


M  1  1  1  1  1  1  1  1  1  M  II  | 

Source  Port 

1  1  I— 1 — t — 1 1 — 1 — J 1 -  1  1  1  J 

Destination  Port 

Sequence  Number 

Acknowledge  Number 

Data  „  UAPRSF 

.  Reserved  RCSSY  1 

0ffset  G  KHTI1N 

Window  Size 

Checksum 

Urgent  Pointer 

Options 

Padding 

Data 


Figure  2.6:  TCP  Packet  Structure  -  32-bit  wide  IPv4  TCP 
packet  structure  [93]. 

connections.  The  level  of  granularity  (packet,  connection,  or  session)  must  be  consid¬ 
ered: 


Packet  Granularity  -  IP,  UDP,  and  Internet  Control  Message  Protocol  (ICMP) 
[208]  arc  all  connectionless  IPv4  protocols.  That  is,  communications  can  be  sent 
without  prior  arrangement.  IP,  ICMP,  Address  Resolution  Protocol  (ARP), 
Routing  Information  Protocol  (RIP),  and  Open  Shortest  Path  First  (OSPF) 
routing  protocol  can  be  placed  at  the  network  layer  (Figure  |24~]).  UDP,  on  the 
other  hand,  is  a  connectionless  transport  mechanism  at  the  network  transport 
layer  (Figure  [2d|. 


Connection  Granularity  -  At  the  transport  level  (Figure  2.1)  the  most  fundamen¬ 
tal  relationship  between  the  packet  structures  is  a  socket  identified  by  the  IP 
Source  Address/Port  pair  and  the  associated  IP  Destination  Address/Port  pair. 
At  an  arbitrary  point  in  time  a  TCP  connection  between  a  server  and  client 
can  be  identified  by  the  IP  Source  Address/Port  pair  and  the  associated  IP 
Destination  Address/Port  pair.  Much  TCP  behavior  is  driven  by  timers  and 
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timeouts.  TCP  inherently  embodies  greater  temporal  causality  that  is  encoded 
in  sequence  and  acknowledgment  numbers  in  a  specific  TCP  connection.  This 
level  of  granularity,  especially  for  TCP  connections,  has  been  widely  studied 
(e.g.  [42,100,119]). 

Session  Granularity  -  The  IPv4  suite  provides  for  sessions  using  Session  Initiation 
Protocol  (SIP)  [221]  which  is  a  transport  independent  application  layer  control 
protocol.  SIP  supports  session  setup,  maintenance,  and  teardown  with  one  or 
more  participants.  SIP  is  used  in  voice,  video,  and  instant  messaging  appli¬ 
cations.  Another  alternative  is  International  Telecommunications  Union  Stan¬ 
dardization  Sector  (ITU-T)  X.225  connection-oriented  session  protocol  While 
session  specifications  are  available  application  level  protocols  often  use  custom 
session  level  management. 

2. 1.2. 2  Packet  and  Connection  Fragmentation.  Fragmentation  is  a 
feature  of  IP  to  support  transport  of  packets  across  networks  with  different  Maxi¬ 
mum  Transfer  Units  (MTU).  Shannon  [240]  presents  a  detailed  study  of  packet  level 
fragmentation.  The  majority  of  fragmented  traffic  in  the  study  was  UDP.  While 
ICMP,  IP  security  (IPsec)  [129],  and  Virtual  Private  Network  (VPN)  tunneled  traffic 
were  also  commonly  fragmented.  Fragmentation  occurred  in  0.5  percent  of  the  to¬ 
tal  traffic  observed  [240].  Reconstruction  of  connections  from  packet  traces  has  the 
following  issues  [260]: 

1.  The  IP  datagrams  may  be  fragmented. 

2.  IP  fragments  may  arrive  out  of  order. 

3.  Packets  may  be  missing  in  the  network  trace  because  they  were  dropped  during 
capture. 

4.  Adversaries  might  intentionally  create  non-deterministic  situations  with  tools 
like  fragrouter  [1]. 

3ITU-T  Recommendations  are  available  at  http://www.itu.int/rec/T-REC/en. 
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Connection  oriented  TCP  also  suffers  from  fragmentation  effects  when  TCP 
segments  are  interleaved  because  the  encapsulating  IP  fragments  arrive  out  of  order 
[20]. 

There  is  no  direct  algorithmic  means  for  reconstructing  a  connections  content 
from  a  trace  of  network  packets.  Correcting  IP  packet  fragmentation  and  reassembling 
TCP  connection  stream  requires  a  significant  portion  of  the  IPv4  protocol  stack  [260]. 
Turner  [260]  proposed  the  following  possibilities:  pre-process  data  with  libnids  which 
partially  implements  a  Linux  2.0  TCP/IP  protocol  stack  in  user-space  [275];  use  packet 
reassembly  code  from  Wireshark  [281];  port  a  TCP/IP  stack  from  open  sources;  or 
write  a  custom  protocol  stack  from  scratch. 

2. 1.2. 3  Data  Representation.  Application  layer  protocols  can  be  clas¬ 
sified  into  binary  and  human  readable  ASCII  text  protocols.  Text  protocols  restrict 
their  operators  and  payload  to  printable  ASCII  text  characters.  Binary  protocol  op¬ 
erators  are  comprised  of  data  holds  that  can  be  mapped  to  standard  data  types  such 
as  integers  or  strings.  For  example,  SMTP  uses  an  ASCII  text  representation  for  the 
operators  while  protocols  like  RPC  and  SMB/CIFS  use  binary  representations.  This 
means  that  E,  the  set  of  protocol  operators,  can  be  structured  text  or  structured 
binary  data. 

2.1.3  Other  Data  Characteristics.  User  behavior  also  exhibits  higher  order 
structure.  The  problem  is  examined  by  Kannan  [125]  who  defines  a  session  as  a  group 
of  network  connections  related  to  a  network  task.  A  network  task  is  activity  that 
emanates  from  an  external  event  (the  causal  origin)  [125] .  We  do  not  examine  higher 
order  session  structures  such  as  aggregate  user  session  structures  for  web  traffic. 

Finally,  protocols  which  use  encryption  (e.g.  SSL,  SSH,  RDP)  or  are  tunneled 
through  encryption  mechanisms  offer  additional  challenges.  Wright  [277]  presents  an 
exploratory  look  at  identification  of  packets  in  encrypted  tunnels.  Gebski  [91]  also 
conducts  an  experimental  evaluation  of  a  model  to  detect  and  identify  TCP  protocols 
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[231], 


in  encrypted  tunnels.  We  will  not  examine  design  recovery  from  encrypted  or  tunneled 
traces. 


2.2  Network  Trace  Collection 


Placement  of  the  probe  points  for  trace  data  collection  must  be  considered, 
we  should  rationalize  the  observation  points  used.  International  Standard  9646  (IS- 
9646)  defines  four  test  architectures  for  OSI  protocol  conformance  testing  [68].  The 
four  types  are  local,  distributed,  coordinated  and  remotes  test  architectures  [257]. 
We  can  use  the  ISO  test  architecture  descriptions  to  classify  where  a  protocol  reverse 
engineering  effort  collects  trace  data.  The  level  of  analysis  can  alternatively  be  clas¬ 
sified  according  to  the  fidelity  of  observation  described  by  Bhargavan  in  [24]  which  is 
partially  determined  by  the  placement  of  a  monitor  for  a  device  under  test. 


Saleh  also  discusses  placement  of  data  collection  points,  shown  in  Figure  |2.7| 
as  the  Upper  Service  Access  Point  (USAP)  and  Lower  Service  Access  Point  (LSAP) 
[230].  Like  Saleh  [230]  we  will  refer  to  trace  collections  above  the  protocol  under 
observation  as  Upper  Service  Access  Points  and  trace  collections  below  the  protocol 
under  observation  as  Lower  Service  Access  Points.  Saleh  considers  data  collected  at 
the  LSAP  to  provide  protocol  primitives  and  data  collected  at  an  USAP  to  provide 
service  primitives  to  layers  above. 
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Figure  2.8:  Bro  deployment  with  network  tap  -  the  network 

tap  duplicates  the  physical  layer  signaling  for  in-line  full-duplex 
traffic  analysis  by  the  Bro  Intrusion  Detection  System  [141]. 

In  a  TCP/IP  network  port  mirroring  and  hubbing  out  are  two  techniques  for 
trace  collection  that  do  not  require  instrumenting  the  protocol  under  investigation. 
Port  mirroring,  or  port  spanning,  is  supported  by  some  Layer-3  switching  devices. 
The  switching  device  copies  all  traffic  on  a  user  specified  port  to  another  physical 
port  on  the  switch  [233].  Hubbing  out  is  a  technique  in  which  a  target  device  and 
analyzer  are  located  on  the  same  Ethernet  network  segment  by  plugging  them  in  to 
the  same  hub  [233]. 

Packets  dropped  from  the  sample  data  by  the  capture  mechanism  will  be  an 
issue  for  high  bandwidth  traffic.  Solutions  include  using  custom  high-speed  collection 
hardware,  such  as  [200],  implementing  hardware  to  duplicate  network  traffic  at  the 
physical  layer,  or  limiting  the  study  to  low-speed  protocols  that  can  be  captured  with 
a  high  degree  of  confidence.  Bhargavan  used  a  modified  Linux  system  to  sniff  TCP 
network  traffic  by  hubbing  out  on  a  100  Mbps  connection,  describing  this  configuration 
as  a  co- networked  monitor  [24],  The  implemented  of  the  Bro  intrusion  detection 
system  recommend  the  use  of  a  physical  layer  network  tap,  shown  in  Figure  [2~8]  that 
exactly  duplicates  the  physical  layer  signaling  for  in-line  full-duplex  traffic  analysis 

[141]- 
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2.3  Application  Level  Protocol  Data  Flow  Recovery 

We  differentate  data  flows  from  the  usage  of  network  flows  that  is  commonly 
used  in  graph  theory  [65].  This  is  because  we  are  not  interested  in  modeling  the  net¬ 
work  as  a  graph  but  instead  modeling  the  actual  flow  of  data,  thus  data  flow,  which 
is  captured  in  a  trace  file.  Others  have  characterized  packet  data  flows  as:  packet 
trains  [118];  streams  and  torrents  [35];  or  flights  [239].  As  discussed  in  Section 
we  can  define  a  data  flow  as  a  quintuple  (  source  IP,  source  port,  destination  IP,  desti¬ 
nation  port,  protocol)  [264],  An  advantage  of  using  tuples  is  that  they  support  formal 
definitions  of  the  operators  that  apply.  Also,  defining  the  mathematical  symbology 
allows  us  to  discuss  the  problem  domain  domain  in  a  more  compact  form. 

2.3.1  Naive  Flow  Membership.  In  the  case  of  TCP/IP,  at  the  packet  level 
of  granularity,  we  can  use  the  TCP  setup  (three-way  handshake)  to  recognize  the 
beginning  of  a  partial  flow  and  TCP  teardown  to  recognize  the  end  of  a  partial  flow. 
Several  studies  have  used  TCP  flows  defined  by  the  SYN/FIN  control  mechanism  in 
TCP  to  denote  flows  [42].  This  representation  is  adequate  if  we  consider  network 
traffic  as  bi-directional  flows  comprised  of  arbitrary  groupings  of  packets  defined  by 
the  attributes  of  their  endpoints.  Our  naive  definition  of  flow  membership  does  not 
address  the  temporal  nature  of  application  protocol  communications  at  the  session 
level.  Also,  it  does  not  account  for  timeout  or  connection  loss. 

2.3.2  Accurate  Flow  Membership.  In  reality,  we  will  not  get  enough  infor¬ 
mation  from  the  five-tuple  (source  IP,  source  port,  destination  IP,  destination  port, 
protocol)  to  accurately  determine  a  packets  flow  membership.  The  traditional  use  of 
well  known  port  number^]  to  identify  application  level  protocols  is  in  question.  In 
contemporary  network  traces  port  numbers  no  longer  provide  reliable  recognition  of 
application  protocol  traffic  types.  Current  Internet  traffic  often  ignores  established 
port  number  conventions  [154],  Statistical  techniques  have  been  proposed  to  accu- 

4The  Internet  Assigned  Numbers  Authority  Port  Numbers  provides  a  list  of  well  known  ports 
[116]- 


2. 1.2.2 
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St  c  t  t 

220  crow.eyrie.af.mil  ESMTP  Sendmail  8. 8. 7/8. 8. 7;  Hon,  1  Mar  1999  08:14:45  -0500 

EHL0  beta.banana.edu 

500  Command  unrecognized 

HEL0  beta.banana.edu 
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250  <megans@crow. eyrie. af.mil>  OK 

DATA 

354  Enter  mail,  end  with  on  a  line  by  itself 

Received:  (from  mail@localhost)  by  beta.banana.edu  (SHI-8.6/SHI-SVR4) 

.id:  CAA4682;  Mon,  1  Mar  1999  08:14:39  -0400 

Date:  Hon,  1  Mar  1999  08:14:39  -0400 

To:  svetlank@falcon.eyrie.af.mil,  klemensh@robin.eyrie.af.mil,  quincyl@robin.eyrie.af.mil,  darieni@swallow.eyrie.af.mil, 
liame@swallow.eyrie.af.mil,  haileyv@duck. eyrie, af.mil,  richelld@duck.eyrie.af.mil,  ebertm@duck.eyrie.af.mil, 
uschic@pigeon . ey rie . af . mil ,  lo renas@swan . eyrie . af . mil ,  reint@swan . eyrie . af . mil , 
megans@crow.eyrie.af.mil,  lilyf@crow.eyrie.af.mil, leandere@marx.eyrie.af.mil 

Subject;  Rm  if  The  amount  Between  functionality; 

Message  - Id :  <19990301081439 . CAA4682> 

Rm  if  The  amount  Between  functionality;  that  any  Suggestions  please 
follow  The  which  is  sometimes  However,  to  bring  it  working  in  over 
doors  into  Yoyo  was  Right  I'm  not  Answer,  Data,  Storage  device  frames 
original  on  Free  hybrid  Phone. 

250  Hail  accepted 

QUIT 

221  Closing  connection 

|  [^ Find  ||  ^^| Save  As  Print  |  Entire  conversation  (1406  bytes)  0  |  <*)  ASCII  O  EBCDIC  O  Hex  Dump  O  C  Arrays  O  Raw 

|  (0)  Help  |  %  £lose  j  |  |^1  Filter  Out  This  Stream 


Figure  2.9:  Wireshark  Following  SMTP  Conversation  -  Wire- 
shark  can  reconstruct  the  conversation  between  the  server  and 
client  using  internal  knowledge  of  the  SMTP  protocol. 


rately  identify  the  type  of  application  protocol  encapsulated  in  connections.  Ma  [154] 
presents  a  Markov  process  model  technique  and  a  longest  common  subsequence  ap¬ 
proach.  Both  techniques  provide  statistical  recognition  of  the  application  level  con¬ 
tents  encapsulated  in  transport  level  connections. 


Figure  [279]  shows  an  application  level  SMTP  data  flow  reconstructed  by  follow¬ 
ing  the  underlying  transport  level  TCP  connection  using  Wireshark.  Wireshark  can 
reconstruct  the  conversation  between  the  server  and  client  because  it  uses  internal 
knowledge  of  the  structure  of  the  SMTP  protocol. 


2.3.3  Stateful  vs.  Stateless.  Application  data  flow  membership  will  also  be 
impacted  by  the  state  characteristics  of  the  protocol  under  consideration.  A  stateful 
server  maintains  persistent  information  about  its  clients  while  a  stateless  server  does 
not.  SMTP  and  POP3  are  stateful  while  HTTP  is  stateless.  State  information  is 
added  to  web  applications  that  use  HTTP  by  the  use  of  cookies  that  encode  the 
session  id  and/or  session  state.  If  the  server  is  stateless  but  maintains  soft- state,  data 
that  is  maintained  for  the  client  on  the  server  for  a  limited  time  [253,  p.91],  then  we 
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must  have  a  method  that  detects  early  session  termination  or  incomplete  sessions.  As 
is  the  case  when  a  user  navigates  away  from  an  HTTP  based  web  application  without 
actively  terminating  the  session.  This  is  difficult  without  a  construct  that  handles 
timeout  events. 

Paxson  and  Floyd’s  study  of  wide-area  TCP  arrival  processes  found  that  session 
arrivals  were  well  modeled  by  Poisson  processes  with  hourly  rates  even  though  individ¬ 
ual  connection  arrivals  were  not  [195].  Nuzman  [186]  found  that  the  arrival  of  HTTP 
connections  aggregated  into  sessions  also  reflect  a  Poisson  process.  Kannan  [125]  used 
this  observation  as  a  key  part  to  discovering  and  characterizing  causality  in  network 
traffic.  Meent  [264]  uses  a  20  second  interval  for  membership  in  a  flow  at  the  packet 
level.  This  means  TCP/IP  packets  identified  by  the  quintuple  must  be  within  20 
seconds  of  each  other  to  be  classified  as  members  of  the  same  flow.  McGregor  also 
provides  some  clustering  techniques  for  classifying  flows  in  [165]. 

2.3.4  Single- connection  vs.  Multi-connection.  Data  flow  membership  will 
also  be  impacted  by  how  the  protocol  uses  the  underlying  transport.  SMTP  and 
POP3  session  boundaries  are  easily  detected  because  each  session  is  encapsulated  in  a 
single  TCP  connection  [112,113].  This  means  the  connection  and  session  granularity 
are  equivalent  for  SMTP  and  POP3.  On  the  other  hand,  version  1.1  of  the  HTTP 
specification  allows  for  re-use  of  open  TCP  connections  for  multiple  requests  [80]. 
HTTP  based  web  applications  use  session  identifiers  to  associate  sessionless  HTTP 
requests  into  a  logical  application  session. 

To  determine  the  start  and  end  of  a  complete  flow  or  session  of  an  application 
level  protocol  we  must  understand  the  operators  that  support  session  setup  and  tear- 
down.  Another  possibility  is  to  detect  session  identifiers  encoded  in  the  packet  traces 
of  the  application  layer  protocol  under  consideration. 

2.3.5  Single- channel  vs.  Multi-channel.  Another  concern  is  the  number  of 
connection  level  channels  that  make  up  the  communication.  Session  detection  for 
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Client.  1 291 


SYN|ACK|PUSH|FIN 


Server.21(ftp) 


24  packets,  1202  bytes 
SYN|ACK 

Client. 1291  ◄ - Server.21(ftp) 

1  packet,  48  bytes 


ACK|PUSH|FIN 

Client.1291  ^ - Server.21(ftp) 

23  packets,  4160  bytes 


SYN|ACK|PUSH|FIN 

Client.  1 292  ◄ - Server.2 1  (ftp) 

250  packets,  359424  bytes 

SYN|ACK|FIN 

Client.  1 292  - ►  Server.2 1  (ftp) 

132  packets,  5288  bytes 


Time  (seconds) 

Figure  2.10:  Flow  level  breakdown  of  a  simple  FTP  transfer 
-  the  control  channel  is  on  port  1291  and  the  out-of-band  data 
channel  on  port  1292  [15]. 


single-connection,  single-channel  protocols,  like  SMTP  or  POP3,  can  be  determined 
from  TCP  socket  connection  status.  For  more  complex  multi-channel  protocols  we 
must  understand  the  internal  structure  of  the  protocols  (e.g.  FTP  and  RPC)  to 
properly  group  the  packets  and  connections  that  make  up  the  application  level  data 


flows  [15,260].  A  notional  multi-channel  FTP  flow  is  shown  in  Figure  2.10  which 
shows  the  control  channel  and  an  out-of-band  data  channel. 


2.4  Application  Level  Network  Traces  into  Automata 

We  must  address  the  following  four  issues: 

•  Network  trace  collection. 

•  Application  level  protocol  data  flow  recovery. 

•  Protocol  format  (£)  recovery. 

•  Protocol  transition  function  (<5)  recovery. 
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We  discussed  the  issues  of  trace  collection  in  Section  2.2  and  data  flow  recovery 


in  Section  2.3  Here  we  concentrate  on  format  (E)  recovery  and  transition  function 
(5)  recovery  from  existing  traces. 


Protocol  format  recognition  is  coupled  with  the  data  portion  of  a  protocol.  As 
discussed  by  Lee  in  [142]  the  data  portion  specifies  functions  that  involve  parameter 
values  associated  with  messages. 


Protocol  transition  function  recognition  focuses  on  the  control  portion  of  a  pro¬ 
tocol  over  the  data  portion  [142],  Given  two  endpoints  of  a  distributed  communica¬ 
tions  system,  a  sender  S  =  (Qs,  E,  Ss,  qs0,  Fs)  and  receiver  R  =  ( Qrj  E,  5r,  qr0j  Fr ),  how 
can  we  go  about  recovering  a  model  of  5S  and  5rl  What  is  the  minimum  knowledge 
we  need  a  priori  to  recover  the  design  using  only  a  captured  data  flow  of  the  finite 
members  of  E? 


While,  we  can  recover  the  expected  Q,Tj,S,q0,  and  F  from  protocol  specifica¬ 
tions,  by  analysis  of  source  code,  or  even  reverse  engineering  binary  code,  what  if 
specifications  or  source  code  are  not  available? 


2.5  Protocol  Design  Recovery 

We  must  determine  operator  formats  (data  portion)  and  determine  automata 
(control  portion).  This  means  we  must  know  or  infer  the  states,  operators,  transitions, 
initial  state(s),  and  final  state(s)  which  define  the  behavior  of  the  protocol  under 
consideration.  To  frame  the  discussion  we  present  an  overview  of  forward  engineering 
and  re-engineering  practices.  F igure [2 . 1 1 1 shows  the  inter-relationship  of  re-engineering 
practices. 

2.5.1  Forward  Engineering.  Forward  engineering  is  the  traditional  engineer¬ 
ing  process  of  moving  from  high-level  abstractions  to  the  physical  implementation  of  a 
system  [40,214],  Protocol  engineering  is  an  interdisciplinary  approach  which  empha¬ 
sizes  the  use  of  sound  engineering  principles  and  formal  methods  to  develop  reliable 
communication  software  [230].  Protocol  forward  engineering  efforts  can  utilize  formal 
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Physical  Implementation 


Restructuring 


Restructuring1 
Restructuring 
Restructuring 
Restructuring 

Level  of  Abstraction 


Figure  2.11:  Relationship  of  Re-engineering  Practices  -  Re¬ 

verse  engineering  attempts  to  recover  higher  levels  of  abstrac¬ 
tion,  restructuring  modifies  system  artifacts  at  the  same  level  of 
abstraction  and  re-engineering  uses  reverse  engineered  artifacts 
to  generate  a  new  physical  implementation  [214], 


techniques  for  analysis  and  modeling  but  many  protocols  are  designed  and  imple¬ 
mented  using  informal  approaches  [109,231].  Even  widely  used  protocols  like  HTTP, 
SMTP,  and  POP3  rely  on  English  language  descriptions  of  correct  control  flow. 


2. 5.2  Reverse  Engineering.  Reverse  engineering  is  the  practice  of  discovering 
the  technological  principles  of  a  device/object  or  system  through  analysis  of  its  struc¬ 
ture,  function,  and  operation  [40,108,214].  Software  reverse  engineering  concentrates 
on  analysis  of  software  through  disassembly  and  debugging  of  a  software  program. 
In  some  sense  the  inverse  of  forward  engineering  software  reverse  engineering  is  also 
referred  to  as  reverse  code  engineering  (RCE). 

2.5.3  Protocol  Reverse  Engineering.  Protocol  reverse  engineering  is  the  ap¬ 
plication  of  reverse  engineering,  often  including  RCE,  to  recover  the  automata  and 
operators  which  define  the  protocol.  Protocol  reverse  engineering  can  be  tightly  cou¬ 
pled  with  RCE.  If  the  protocol  specification  is  not  available,  RCE  of  source  code  can 
provide  many  clues  to  the  structure  of  the  protocol.  RCE  can  also  be  applied  to  the 
binary  programs  that  implement  a  protocol  to  approximate  the  original  design/engi- 
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neering  decisions.  Protocol  reverse  engineering  without  access  to  specifications  or 
source  code  can  be  significantly  more  challenging. 

2.5.3. 1  Static  vs.  Dynamic.  Static  vs.  Dynamic  protocol  reverse  en¬ 
gineering  (see  Saleh  [230])  are  differentiated  by  information  source.  Static  protocol 
reverse  engineering  is  the  recovery  of  protocol  information  from  specification  docu¬ 
ments  and  implementation  artifacts  including  system  documentation,  source  code,  or 
even  reverse  engineered  binary  code.  The  process  is  static  from  the  perspective  of  the 
protocol  under  inspection.  RCE  of  binary  code  will  likely  involve  dynamic  runtime 
analysis  at  the  local  level.  Lie  proposed  creating  models  from  protocol  code  using 
an  extensible  compiler  system  and  applied  the  system  to  analyze  cache  coherence 
protocols  in  multi-processor  systems  [149]. 

Dynamic  protocol  reverse  engineering  is  the  recovery  of  protocol  information 
from  observations  of  the  system  in  action.  If  the  protocol  is  part  of  a  layered  architec¬ 
ture  (such  as  the  TCP/IP  protocol  suite  which  implements  in  part  the  OSI  reference 
model)  the  traces  may  be  collected  at  an  observation  point  established  at  a  layer 
below  the  protocol  or  a  layer  above  the  protocol.  An  observation  point  above  the 
protocols  layer  should  collect  events  that  are  caused  by  service  requests  from  layers 
above  the  protocol  under  observation. 

Much  like  putting  a  multi-meter  or  oscilloscope  at  the  inputs  (lower  service 
access  point)  and  outputs  (upper  service  access  point)  of  an  electronic  circuit  to 
determine  its  internal  operation  by  collecting  traces  of  inputs  and  outputs  we  will 
collect  packets  from  the  lower  service  access  point  (LSAP)  and  upper  service  access 
point  (USAP)  of  the  protocol  automaton  under  consideration.  Then  we  can  attempt 
to  infer  the  internal  operation  of  the  protocol  automaton  or  construct  an  equivalent 
automaton. 


2. 5. 3. 2  Who  needs  it?  Distributed  systems  rely  on  the  correctness 
of  both  open  and  proprietary  protocols  to  provide  functionality.  As  an  example, 
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Microsoft’s  proprietary  Server  Message  Block/Common  Internet  File  System  (SMB/- 
CIFS)  is  often  encapsulated  in  TCP  connections  [105].  Deep  understanding  of  the 
protocols  which  underpin  distributed  systems  is  increasingly  important  to  computer 
security  efforts  [100,135,250].  Protocol  reverse  engineering  is  often  the  only  option 
available  to  develop  an  understanding  of  proprietary  protocols  that  allows  us  to  vali¬ 
date  that  the  protocol  implementation  is  correct,  reliable  and  secure. 

Protocol  reverse  engineering  has  been  proposed  for  a  range  of  purposes  including: 
conformance  testing  [143];  design  recovery  [231];  to  develop  interoperable  interfaces 
between  incompatible  protocols  [194,  248] ;  as  a  means  to  enhance  network  security 
analysis  [100];  and  to  develop  signatures  for  network  intrusion  detection  systems  [178]. 
Despite  these  practical  uses  the  practice  is  hindered  by  legal  obstacles  designed  to 
thwart  theft  of  trade  secrets  and,  a  perhaps  deserved,  perceived  lack  of  legitimacy 
[111,199], 

2. 6  State- of -practice 

State-of-practice  reverse  engineering  tools  and  techniques  for  hie  formats  and 
binary  executables  are  presented  through  several  hacker  oriented  books  and  websites 
(e.g.  [72,75,87,98,132]).  Collection  and  observation  of  TCP/IP  network  traffic  is 
also  well  covered  (e.g.  [48,84,98,190,242,251,279]).  There  are  a  few  studies  (e.g. 
[82,100,143,229])  and  web  articles  [210]  that  discuss  protocol  reverse  engineering. 
The  topic,  possibly  due  to  its  somewhat  underground  nature,  has  not  received  broad 
academic  treatment. 

Fortunately,  the  state-of-practice  is  documented  by  open  source  projects  that 
apply  protocol  reverse  engineering  to  re-implement  proprietary  protocols.  Three  se¬ 
lected  open  source  projects  that  use  protocol  reverse  engineering  methods  are  Pidgin, 
Samba,  and  Rdesktop  [38,50,258].  Because  the  protocol  specifications  are  not  com¬ 
pletely  available  all  three  projects  rely  on  reverse  engineering. 
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2.6.1  Tools.  The  basic  toolset  capabilities  required  for  TCP/IP  protocol 
reverse  engineering  is  the  ability  to  capture/record,  manipulate,  and  analyze  TCP/IP 
packets.  There  is  a  wide  range  of  open  source,  public  domain,  and  commercial  tools 
that  provide  capabilites  that  are  useful  for  protocol  reverse  engineering  efforts^}  Son- 
nenburg  et  al  [245]  argue  that  open  source  machine  learning  is  key  to  supporting 
experimental  reproducibility.  Joyner  and  Stein  also  argue  the  value  of  open  source 
software  to  mathematical  studies  in  an  opinion  piece  [122],  In  the  interests  of  repro¬ 
ducibility  we  examined  tools  that  are  openly  available  either  as  open  source  or  public 
domain.  Below  we  discuss  several  open  source  tools  that  can  assist  with  TCP/IP  pro¬ 
tocol  reverse  engineering  within  the  context  of  our  four  reverse  engineering  problems: 
trace  collection,  data  flow  recovery,  format  recovery,  and  transition  function  recovery 


2. 6. 1.1  Trace  Collection.  One  commonly  referred-to  tool  is  Wireshark 


shown  in  Figure  2.12  [48].  Wireshark,  formerly  called  Ethereal,  is  an  open  source 
packet  capture  program.  It  includes  dissector  algorithms  which  recognize  and  parse 
many  text  and  binary  application  level  protocols  [190,  p.79].  Another  often-mentioned 
tool  for  packet  capture  is  tcpdump  [51].  Packet  data  flows  are  stored  in  several  hie 
formats.  Because  Wireshark  and  tcpdump  both  use  pcap,  libpcap  under  UNIX  and 
WinPcap  library  under  Windows,  their  hie  formats  are  compatible.  WinPcap  on 
Windows  and  libpcap  on  UNIX  provide  similar  application  programming  interfaces 
(API)  and  are  commonly  referred  to  as  the  pcap  API.  The  tcpdump  website  that 
hosts  libpcap  provides  links  to  several  related  utilities  and  projects  built  with  the 
pcap  API. 


In  the  assessment  of  Kreibich,  author  of  NetDude,  the  trifecta  of  Pcap,  tcpdump, 
and  Wireshark  form  the  de  facto  standard  tool  set  for  TCP/IP  protocol  reverse  en¬ 
gineering  [133]. 


5List  of  tools  (e.g.  [56]  and  [51,  Related  tools])  are  readily  available  via  the  Internet. 


24 


2.6. 1.2  Data  Flow  Recovery.  There  are  tools  that  can  reconstruct 
connection  level  traces  from  packet  traces.  For  example,  NetDude  [133]  includes  a 
demux  (de-multiplexer)  plugin  that  breaks  packet  traces  into  sub-traces  along  the 
TCP  connection  boundaries.  The  sub-traces  are  stored  in  pcap  compatible  trace  hies. 
Regrettably,  the  sub-traces  do  not  contain  the  complete  communication  between  the 
server  and  client.  Bhargavan  proposes  the  development  of  Network  Event  Recognition 
Language  (NERL)  [26,  27]  to  reconstruct  application  level  protocols  for  monitoring 
purposes.  A  NERL  implementation  is  described  in  [25],  including  analysis  of  SMTP, 
but  a  reference  implementation  is  not  made  publicly  available. 

There  are  some  tools  to  reconstruct  session  level  traces  from  packet  traces. 
Chaosreader  [34]  and  tcphow  [73]  both  recreate  session  hows  from  packet  traces.  Un¬ 
fortunately,  the  output  from  both  tools  is  formatted  for  human  review  not  automated 
processing. 


2.6. 1.3  Format  and  Transition  Recovery.  We  discovered  no  publicly 
available  tools  that  provide  automated  format  or  transition  function  recovery  from 
network  traces.  This  is  further  discussed  in  Section  12.81 

2. 6. 1-4  Miscellaneous  Tools.  The  pcap  API  is  used  by  a  wide  range  of 
utilities  and  tools  to  generate  statistics  from  packet  traces.  Automated  support  for 
packet  level  analysis  is  available  from  Tstat  [223]  and  tcptrace  [191].  Several  tools  add 
features  to  support  online  and  live  packet  manipulation  and  replay  of  captured  data. 
Flowreplay  [260],  based  on  Turner’s  earlier  work  on  tcpreplay  for  datalink  replay, 
was  designed  to  replay  TCP/IP  traffic  at  the  transport  and  application  layer.  Scapy 
is  an  interactive  packet  manipulation  program  [29].  Paxson  created  tcpanaly  [195] 
to  automatically  analyze  TCP  implementations  at  the  packet  level  of  granularity. 
The  tool  is  not  publicly  available.  Other  tools  are  focused  on  filtering  traffic  from 
real  time  collections  for  direct  observation  by  a  human  operator.  Examples  include: 
Trafshow  [272],  Ngrep  [217]  and  Ntop  [64]. 
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Figure  2.12:  Wireshark  -  Ready  for  “test,  capture,  and  stare”. 


2.6. 1.5  Programming  Toolkits.  Finally,  there  are  several  toolkits,  be¬ 
sides  Pcap,  which  provide  APIs  to  ease  the  development  of  network  tools.  The  libnet 
library  [234]  is  a  toolkit  allowing  the  construction  and  injection  of  packets.  Another 
toolkit  is  libdnet  which  supports  low  level  network  operations  for  several  languages  (C, 
C++,  Python,  Perl  and  Ruby)  on  many  UNIX  variants  and  Microsoft  Windows  [244], 
The  libevent  toolkit  provides  an  event  oriented  API.  The  event  abstraction  allows  de¬ 
velopers  a  design  alternative  to  polling  loops  and  threads  when  processing  network 
traces  [160].  Libevent  supports  per  event  timers  with  callbacks  on  timeout.  Another 
library  is  libnids  which  provides  a  Linux  version  2.0  TCP/IP  protocol  stack  in  user 
space  [275].  The  libnids  toolkit  uses  libpcap  and  libnet  internally  to  provide  IP  frag¬ 
ment  reassembly  and  IP  stream  reconstruction  [275].  The  libnet,  libdnet,  libevent, 
and  libnids  toolkits  are  accessible  from  scripting  languages  allowing  rapid  adhoc  pro¬ 
totyping. 
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2.6.2  Techniques.  Although  protocol  reverse  engineering  involves  tools,  the 
key  is  the  reversers  ability  to  understand  the  assumptions  and  design  decisions  of 
the  people  who  created  the  protocol  specification  or  implemented  the  protocol,  and 
then  undermine  them.  Reverse  engineering  requires  in-depth  knowledge  of  myriad 
technical  specifications  and  specific  implementations  coupled  with  an  understanding 
of  the  engineering  decisions  of  the  original  designers. 


!  It  should  be  noted  that  the  projects  used  both  online  and  offline  tech¬ 
niques. 

One  protocol  reverse  engineering  technique  used  by  all  three  projects  contrib¬ 
utors  is  described,  somewhat  lightheartedly,  as  test,  capture,  and  stare  [258].  The 
technique  is  an  adhoc  approach  that  depends  on  fast  turnaround  of  simple  tests  that 
involve  capturing  network  traffic  resulting  from  varying  parameters  in  operators.  Test, 
capture,  and  stare  informally  defines  the  state-of-practice  for  protocol  model  reverse 
engineering  from  captured  data  flows. 


Two  other  prominent  techniques  are  protocol  filters  and  protocol  specific  scan¬ 
ners.  A  protocol  filter  is  a  Man  in  the  Middle  (MITM)  proxy  server  that  can  make 
changes  to  protocol  data  before  it  is  passed  on  to  a  target  server.  MITM  proxy 


servers,  as  shown  in  Figure  2.13,  use  session  hijacking  techniques  such  as  ARP  poi¬ 
soning  [98,  p.215]  or  DNS  poisoning  [98,  p.216]  to  re-direct  traffic  from  the  original 
communication  between  a  client  and  server  on  network  path  A.  When  the  MITM 
proxy  server  is  active  the  client  communicates  with  the  MITM  proxy  on  path  B 
and  the  server  communicates  with  the  MITM  proxy  on  network  path  C.  Filters  in 
the  MITM  proxy  allow  specific  or  global  substitutions  of  packet  contents.  A  MITM 
proxy  server  based  protocol  filter  can  be  essential  in  online  (or  live)  reverse  engineer¬ 
ing  [82,  p.9]. 


A  protocol  scanner  uses  known  signpost  values,  such  as  error  codes,  to  guide 
exploration  of  protocol  structure.  Scanners  can  be  used  to  find  new  parts  of  a  protocol 
or  to  determine  some  properties  of  a  protocol  operation  [258].  One  challenge  of 
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Client  Server 


Figure  2.13:  MITM  Attack  Topology  [82,  Figure  2.1]. 


developing  protocol  scanners  is  recognizing  the  exact  meaning  of  known  signpost 
values  and  how  the  proprietary  parser  responds  to  those  values  [258]. 

2. 7  Case  Studies 

Here  we  present  an  overview  of  protocol  reverse  engineering  as  used  in  three 
selected  open  source  projects  that  rely  on  protocol  reverse  engineering  methods:  Pid¬ 
gin  [50],  Samba  [258],  and  Rdesktop  [38].  While  there  are  other  projects  that  apply 
protocol  reverse  engineering  we  find  that  the  selected  projects  are  significant  because 
they  are  concentrated  community  driven  efforts  that  openly  present  the  tools  and 
techniques  used.  In  fact,  the  results  of  each  project  are  available  as  open  source. 

2.7.1  Pidgin.  Pidgin  is  a  multi-protocol  Instant  Messaging  (IM)  client  that 
supports  the  use  of  proprietary  IM  servers  [50].  Pidgins  plug-in  architecture  allows 
independent  efforts  towards  reverse  engineering  and  re-implementing  proprietary  IM 
protocols.  It  is  difficult  to  characterize  the  overall  protocol  reverse  engineering  ap- 


proach  used  by  Pidgin  contributors.  In  part,  because  IM  protocol  specific  plug-ins 
are  developed  independently.  Although  protocol  filters  are  available  for  IM  protocols 
it  is  not  clear  if  they  were  used  by  plug-in  authors  [2].  In  an  E-mail  conversation 
the  implementer  of  one  protocol  mentioned  that  Wireshark  was  critical.  While  the 
client  fully  implements  open  IM  standards  such  as  Extensible  Messaging  and  Pres¬ 
ence  Protocol  (XMMP)  there  are  still  communications  issues  between  the  proprietary 
IM  clients  and  proprietary  servers.  One  example  is  that  file  transfers  using  the  Win¬ 
dows  Live  Messenger  compatible  [^]  .NET  protocol  have  not  yet  implemented  faster 
peer-to-peer  functionality  [241],  Pidgin  plug-ins  for  proprietary  IM  protocols,  such  as 
Windows  Live  and  Yahoo,  must  be  patched  when  the  protocol  is  changed.  The  delay 
between  protocol  changes  and  working  client  software  caused  by  reverse  engineering 
can  be  months. 


2.7.2  Samba.  Samba  is  an  open  source  implementation  of  the  closed  source 
proprietary  Microsoft  Windows  SMB/CIFS  implementation.  The  Samba  project,  un¬ 
like  Pidgin,  must  support  several  different  session  and  application  level  protocols  to 
implement  SMB/CIFS  functionality.  The  project  has  taken  over  12  years  to  man¬ 
ually  reverse  engineer  SMB/CIFS  [58,258].  The  effort  has  successfully  achieved  in¬ 
teroperability  with  Microsoft  Windows  file  and  print  sharing  services.  It  has  also 
implemented  Windows  NT  domain  controller  services  and  the  project  plans  to  imple¬ 
ment  active  directory  cababilities  [105,258].  Samba  is  the  basis  of  Microsoft  Windows 
network  interoperability  for  many  UNIX  based  systems  including  Apples  Mac  OS  X 


since  version  10.1  (Shown  in  Figure  2.14)  [63].  Recently,  the  project  established  the 
Protocol  Freedom  Information  Foundation  (PFIF)  to  acquire  protocol  documentation 
Microsoft  made  available  due  to  European  Union  court  decisions  [259]. 


Samba  is  unique  in  that  the  effort  has  been  guided  by  consistent  leadership. 
Andrew  Tridgcll  the  initiator  of  the  project,  much  like  the  Linus  Torvalds  for  the  Linux 
kernel  or  Richard  Stallman  for  GNU  projects,  serves  as  the  public  representative  of 


6Windows  Live  Messenger  http :  // get .  live .  com/messenger/overview 
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Figure  2.14:  SAMBA  on  Macintosh  OS  X  10  [63]. 


Samba.  Tridgcll  proposed  The  French  Cafe  Technique ,  also  called  network  analysis,  as 
the  projects  overarching  approach  to  reversing  the  SMB/CIFS  protocol  family  [258]. 
Samba  contributors  use  a  range  of  techniques  and  tools  beyond  test,  capture  and 
stare  to  support  their  reverse  engineering  efforts.  Samba  contributors  developed  a 
SMB  protocol  scanner  called  trans2.  It  includes  dozens  of  sub-commands,  what  we 
term  operators,  which  implement  various  types  of  hie  and  hie  system  queries.  The 
scanner  tries  different  information  levels,  data  sizes,  and  object  types  to  determine 
what  operations  exist  and  what  sizes  of  data  are  involved  [258].  The  project  also 
documents  informal  protocol  reverse  engineering  techniques  that  are  not  detailed  in 
the  other  projects.  Two  of  the  techniques  are  trial  and  error  analysis  and  dual  server 
and  backtracking 


2.7.2. 1  Trial  and  Error  Analysis.  Complex  protocols  tend  to  have 
many  error  values  [258].  To  determine  what  each  error  code  means  Tridgcll  recom¬ 
mends  writing  an  error  driven  protocol  scanner  [258].  Tridgcll  also  recommends  an 
error  mapping  approach.  As  an  example,  when  targeting  a  hie  sharing  protocol  he  rec¬ 
ommend  modifying  the  server  to  return  error  XXX  for  filename  ’test_XXX.dat’,  then 
asking  the  proprietary  client  to  access  filenames  from  test_001.dat  to  test_999.dat. 
Finally,  the  message  returned  to  the  client  must  be  collected  for  analysis.  Proxy  error 
mapping  is  a  technique  invented  by  Andrew  Bartlett  to  discover  the  correct  mapping 
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between  Microsoft  Disk  Operating  System  (MS-DOS)  error  codes  and  Windows  NT 
status  codes  [258].  Proxy  error  mapping  extends  trial  and  error  analysis  by  inserting 
a  protocol  proxy  between  the  server  and  client.  The  proxy  contains  error  codes  that 
were  known  from  MS-DOS  clients  (signpost  values),  performed  the  same  operations 
that  returned  the  MS-DOS  error  codes  against  reference  Windows  NT  servers.  Proxy 
error  mapping  works  in  this  case  due  to  a  priori  knowledge  of  the  MS-DOS  error 
codes. 


2. 7.2. 2  Dual  Server  and  Backtracking.  The  dual  server  technique 
can  be  used  to  fine  tune  understanding  of  a  protocol.  The  basic  concept  is  to  write  a 
client  that  connects  to  a  reference  server  in  parallel  with  the  reverse  engineered  server, 
then  to  systematically  generate  protocol  operations  and  finally  compare  the  results 
from  both  servers  [258].  One  problem  with  this  technique  is  the  protocol  may  have 
temporal  dependencies  between  current  and  past  operations.  It  is  possible  that  both 
servers  will  process  several,  maybe  even  thousands,  of  operations  before  generating 
an  error  condition.  For  this  reason  the  operators  sent  to  the  dual  servers  should  be 
recorded  for  replay.  Suspected  erroneous  operations  can  be  removed  from  the  replay 
and  flagged  for  further  consideration  if  the  error  condition  is  not  returned  by  the 
reference  server.  The  process  of  removing  suspect  operators  and  replaying  is  also 
referred  to  as  differential  analysis  [258]. 

2.7.3  Rdesktop.  The  rdesktop  project,  initiated  by  Matthew  Chapman, 
implemented  an  open  source  client  for  Windows  NT  Terminal  Server  and  Windows 
2000/2003  Terminal  Services  [38].  It  supports  the  Microsoft  Remote  Desktop  Protocol 
(RDP)  on  several  UNIX  based  platforms  with  the  X  Window  System  (See  Figure [2T5|. 
The  RDP  protocol  is  an  extension  of  International  Telecommunications  Union  Stan¬ 
dardization  Sector  (ITU-T)  T.128  multipoint  application  sharing  protocol  [38,170]. 
This  project,  unlike  Pidgin  and  Samba,  was  focused  on  a  single  protocol.  Even  though 
Microsoft’s  proprietary  implementation  of  the  RDP  protocol  is  partially  specified  in 
ITU-T  T.128  the  project  required  significant  effort  because  the  protocol  traffic  is  en- 
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Figure  2.15:  Rdesktop  Connection. 

crypted.  Like  Samba  the  project  utilized  the  test,  capture,  and  stare  approach  to  in¬ 
crementally  build  a  working  RDP  implementation  [52,103].  Also,  like  Samba,  project 
contributors  used  protocol  Liters  to  conduct  MITM  attacks  that  revealed  specifics  of 
Microsofts  RDP  protocol  [82],  The  project  implemented  rdpproxy,  an  RDP  specific 
protocol  filter,  to  perform  advanced  processing  of  RDP  data  flows  [82], 


2.8  State-of-Art 


The  contributors  to  the  projects  discussed  in  Section  [2T7j  were  highly  dependent 
on  variations  of  test,  capture,  and  stare.  While  the  Samba  project  presents  additional 
adhoc  methods  such  as  specialized  protocol  scanners,  proxy  servers  and  partial  imple¬ 
mentations  of  reference  servers  each  method  depends  on  the  knowledge  incrementally 
encoded  in  their  re- implementation  of  SMB/CIFS.  Even  after  multi-year  efforts,  the 
state-of-practice  for  protocol  reverse  engineering  has  not  advanced  beyond  painstak¬ 
ing,  time-intensive,  manual  scrutiny  of  packet  captures  using  tools  like  tcpdump  and 
Wireshark. 
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As  discussed  in  Sections  |2.1.1|  and  |2.1.2[  the  key  data  structures  and  data  rela¬ 
tionships  we  must  understand  are  the  protocol  automata  and  operator  formats.  At  its 
heart,  protocol  reverse  engineering  involves  inferring  the  packet  format  of  operators 
and  the  structure  of  the  protocol  automata.  The  most  obvious  approach  is  of  course 
to  access  the  protocol  specification  if  it  is  available.  Another  alternative  is  to  re¬ 
view  the  source  code  for  the  protocol  implementation.  Unfortunately,  many  protocols 
implementations  do  not  release  specifications  or  source  code  for  public  review. 


While  it  is  possible  to  apply  RCE  techniques  to  recover  an  approximation  of  the 
protocol  source  design  we  choose  not  to  examine  this  alternative,  instead  concentrating 
on  model  recovery  from  network  traces.  Regardless,  we  acknowledge  that  a  robust 
protocol  reverse  engineering  effort  has  much  to  gain  from  RCE  of  the  programs  that 
implement  the  protocol  under  consideration.  In  fact,  RCE  of  source  code  or  binary 
code  may  be  the  only  option  if  a  specification  or  other  documentation  is  not  available. 


Furthermore  it  should  be  kept  in  mind  that  recovery  of  operators  or  automata  is 
limited  by  the  completeness  of  the  captured  data  flows.  If  the  captured  data  does  not 
include  a  complete  sample  of  the  operators  used  by  the  protocol  we  will  not  be  able 
to  completely  construct  all  the  operators  or  an  accurate  automaton.  Lets  examine 
current  research  efforts  for  protocol  format  recovery  and  protocol  automata  recovery 
separately. 


2.8.1  Protocol  Format  Recovery.  Building  a  dictionary  of  the  operators  an 
application  protocol  implements  requires  us  to  understand  the  format  of  the  payload 
data  encapsulated  in  TCP  traffic. 

An  approach  drawing  from  bio-informatics  research  is  proposed  by  the  PROTOS 
protocol  genome  project  [104],  The  project  intends  to  utilize  automated  structure 
inference  techniques  for  the  purpose  of  developing  automated  testing  tools.  To  date 
the  project  has  provided  only  notional  results. 

A  similar  concept  that  crops  up  when  searching  on  protocol  structure  inference 
is  a  concept  called  Protocol  Informatics.  Protocol  Informatics  was  introduced  by  a 
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security  analyst  named  Marshall  Beddoe  in  2004.  The  only  remaining  evidence  of  the 
effort  is  a  PythorQ  implementation  of  some  of  the  concepts  available  on  the  Internet 
at  [18]. 

Another  initiative  called  Discoverer,  from  Microsoft  Research,  uses  machine 
learning  techniques  including  clustering  to  infer  protocol  packet  format  idioms  [58]. 
The  authors  evaluated  their  approach  over  HTTP,  Remote  Procedure  Call  (RPC) 
and  SMB/CIFS.  The  authors  focused  on  the  correctness,  conciseness  and  coverage  of 
their  format  inference  leaving  automata  inference  for  future  work.  The  inferred  packet 
formats  covered  over  95  percent  of  their  captured  data  flow  traffic  [58].  Unfortunately, 
the  algorithms  and  data  sets  used  in  their  analysis  have  not  been  released  to  the  public. 

Borisov  et  al  describe  their  Generic  Application-Level  Protocol  Analyzer  (GAPA) 
in  [32].  The  program  implements  a  protocol  specification  language  (GAPAL)  using  a 
syntax  format  similar  to  Augmented  BNF.  GAPAL  is  used  to  prototype  application 
protocols  and  supports  modeling  of  the  underlying  protocol  state  machine.  While 
the  authors  mention  that  the  tool  can  potentially  enable  the  automatic  generation 
of  vulnerability  signatures  they  do  not  implement  any  automated  inference  of  the 
underlying  protocol  format  or  protocol  automata.  Also,  the  GAPA  implementation 
and  GAPAL  specification  are  not  publicly  available. 

Recently,  Fisher  et  al  propose  automated  inference  from  ad  hoc  data  to  generate 
PADS  data  description  language  [81].  A  generic  structure  discovery  algorithm  is 
presented  in  Pseudo-ML  in  [81,  Figure  5].  While  source  for  the  inference  algorithm  is 
not  publicly  released  an  implementation  is  available  through  the  PADS  project  web 
sitcB 


2.8.2  Recovering  Automata.  Automated  recovery  of  protocol  models  as 
different  types  of  automaton  has  been  proposed  in  various  forms  through  out  the  last 
decade.  Message  Sequence  Charts  and  Communicating  Finite  State  Machines  are  two 

'Python  Programming  Language  http :  // www . python .  org/ 

8PADS  project  -  http : //www. padsproj  .org 
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representations  that  have  been  proposed  for  model  recovery  by  automata  synthesis  and 
automata  inference  from  protocol  execution  traces.  Synthesis  differs  from  inference 
in  that  synthesis  uses  a  complete  sample  to  construct  the  target  automaton.  An 
inference  procedure  might  not  have  a  complete,  or  even  characteristic,  sample  to 
generate  hypothesis  automaton. 


2.8.2. 1  Message  Sequence  Charts.  The  use  of  Message  Sequence 
Charts  (MSC)  for  communications  systems  is  discussed  in  detail  in  ITU-T  Z.120  [117]. 
Alur  provides  an  algorithm  for  synthesis  of  MSC  and  establishes  the  foundational  the¬ 
ory  for  MSC  inference.  In  Design  Recovery  from  Observations  [261].  Ural  et  al  propose 
recovery  of  protocol  designs  by  analysis  to  build  MSC.  Their  approach  recovers  a  lat¬ 
tice  of  repetitive  sub-functions  from  a  series  of  observations  [121].  After  recovery  the 
lattice  is  manipulated  to  synthesize  an  MSC  model.  While  Ural  et  al  implemented  the 
synthesis  algorithm  in  C++  they  did  not  implement  a  trace  collection  or  processing 
architecture,  instead  using  generated  text  hies  as  input  to  their  system  [261,  Section 
4],  If  we  choose  MSC  to  model  recovered  protocol  designs  then  the  algorithms  pre¬ 
sented  could  serve  as  an  analytical  backend  for  protocol  performance  properties.  The 
source  code  and  executables  to  their  implementation  are  not  available  to  the  public. 


Another  effort  that  partially  solves  the  problem  of  protocol  model  recovery  is 
Synthesizing  Models  by  Learning  from  Examples  (Srnyle)^]  MSC  inference  is  applied  to 
conformance  testing  by  [31]  in  the  Smyle  system.  MSC  are  used  as  inputs  to  the  model 
synthesis  system.  Smyle  uses  inference  learning  from  MSC  to  develop  a  message¬ 
passing  automata  (MPA)  model  [31,  Definition  3].  Smyle  can  synthesize  a  model 
from  a  given  labeled  scenario  (MSC  samples  marked  as  positive  or  negative).  The 
system  uses  an  extension  of  Angluin’s  L*  algorithm  to  support  MSC  using  a  LearnLib 


(See  Section  B.3.1)  based  inference  mechanism  [31,  Section  4],  Unfortunately,  Smyle 
requires  manual  creation  and  labeling  of  the  positive  and  negative  samples.  Like 
Ural  et  al,  Smyle  does  not  incorporate  a  trace  collection  or  processing  architecture 


9Smyle  -  http :  / / smyle  .  in .  turn .  de/ 
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but  could  also  serve  as  analytical  backend  to  check  performance  properties.  Srnyle 
executables  are  available  for  research  purposes  by  request.  Source  code  is  not  made 
available  at  this  time  due  to  third  party  involvement. 

Finally,  the  company  Event  Helix  provides  a  modified  version  of  Wireshark 
to  synthesize  MSC  like  graphs  from  packet  traffic  at  the  packet  level  of  granularity 
[77].  The  product  does  not  support  synthesis  or  inference  of  application  level  session 
models. 


2. 8. 2. 2  Communicating  Finite  State  Machines.  While  synthesis  of 
Communicating  Finite  State  Machine  (CFSM)  from  execution  traces  has  been  studied 
we  did  not  find  any  attempts  at  CFSM  inference  from  traces.  Saleh  [231]  proposes 
a  semi-automatic  approach  to  reverse  engineering  a  communications  protocol  that 
can  synthesize  a  CFSM  model  of  a  protocols  automata  from  execution  traces.  A 
network  of  CFSM  consists  of  a  set  of  FSM  which  communicate  asynchronously  over 
FIFO  channels  by  sending  and  receiving  typed  messages  [39, 196].  Each  protocol 
entity  is  represented  by  a  CFSM  with  error-free  simplex  channels  represented  by 
unbounded  FIFO  queue  [39].  CFSM  representation  is  useful  for  our  problem  domain 
because  they  can  be  checked  for  non-progress  properties  by  reachability  analysis  and 
reverse  reachability  analysis  [196] .  Although  CFSM  synthesis  algorithms  are  presented 
by  [231]  we  did  not  discover  any  systems  that  implement  CFSM  synthesis  or  inference. 

2. 8. 2. 3  n-Gram  and  Word  Models.  The  n-grarn  and  word  models  tech¬ 
niques  presented  by  Rieck  and  Konrad  are  focused  on  anomaly  detection  for  intrusion 
detection  purposes  [216].  In  [216]  an  incoming  connection  payload  x  corresponds  to 
consecutive  sequence  of  symbols  from  an  alphabet  E.  The  content  of  x  can  be  mod¬ 
eled  as  a  set  of  subsequences  w  taken  from  the  language  L  C  E*.  The  length  of  w 
is  denoted  by  n.  The  model  on  n-gr arris  can  be  derived  by  defining  L  —  En.  L  is 
the  language  containing  all  sequences  of  fixed  length  n.  Provided  a  set  of  delimiter 
symbols  D  C  E,  the  model  of  words  defined  as  L  =  (E  \  D )*  where  every  w  G  L 
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subsequence  of  x  is  delimited  by  symbols  from  D.  The  chosen  language  L  constitutes 
the  basis  for  calculating  similarity  between  network  connections. 


This  could  be  a  useful  means  of  converting  an  input  stream  into  a  vector  of 
values  which  can  be  used  as  a  basis  of  comparison  in  machine  learning  techniques 


such  as  kernel  methods  used  by  Clark  [45,46]  (See  Section  3. 3. 2. 2). 

The  n- gram  approach  strongly  parallels  grammatical  inference  techniques.  In 
fact,  stochastic  A;-TS  models  are  equivalent  to  n-gr arris,  with  n  —  k  [271,  Fact  1].  Fur¬ 
thermore,  n-grarn  models  have  been  combined  with  GI  techniques  such  as  MGGI  [271] 


using  k- TS  representation  and  restricted  £;-TSS  automata  [268]  (See  Section  3.10.4.1). 


2. 8. 2. 4  Other  Approaches.  Communicating-X  machines  [16,17,128] 
are  another  possible  formal  representation  but  they  have  not  received  wide  treatment 
in  respect  to  formal  performance  analysis.  We  did  not  discover  any  attempts  at 
recovery  of  models  as  Communicating-X  machines.  Another  formal  representation, 
Event-Driven  Extended  Finite  State  Machine  (EEFSM),  is  presented  in  [142],  Lee 
formally  associates  the  data  portion  in  EEFSM  as  variables  and  parameters  [142], 
The  EEFSM  construct  is  used  to  develop  passive  testing  algorithms  for  the  OSPF 
and  TCP  state  machines  [142], 

One  practical  application  is  ScriptGen  which  is  an  automated  script  generation 
tool  for  the  Honeyd  virtual  honeypot  p°|  [147].  It  is  designed  to  monitor,  capture,  and 
analyze  packets  used  by  unknown  protocols  then  generate  scripts  for  replay  in  a  hon¬ 
eyd  honey  pot  environment.  Unfortunately,  the  authors  do  not  reveal  the  particulars 
of  their  implementation. 


2.9  Chapter  Summary 

In  this  chapter  we  provided  an  English  languages  description  of  the  problem 
domain  under  consideration.  Next  we  discussed  distributed  systems  and  issues  related 

10Honeyd  Virtual  Honeypot  -  http :  // www .  honeyd .  org/ 
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to  protocol  design  recovery.  Finally,  we  introduced  protocol  reverse  engineering  and 
overviewed  state-of-practice  and  state-of-art  reverse  engineering  tools  and  techniques. 
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III.  Algorithm  Domain 

This  chapter  relates  the  problem  domain  of  dynamic  protocol  reverse  engineering 
from  network  traces  to  the  algorithm  domain  of  grammatical  inference.  We  intro¬ 
duce  Chomsky  Hierarchy  as  a  framework  for  discussing  computational  learnability. 
Next,  we  develop  the  symbolic  model  and  mathematical  notation  that  succinctly  de¬ 
fines  the  characteristics  of  the  algorithm  domain.  Finally,  we  discuss  several  existing 
algorithmic  and  heuristic  approaches  to  grammatical  inference. 


3.1  Design  Recovery  from  Samples  of  Behavior 

Design  recovery  from  samples  of  behavior  has  been  studied  for  several  purposes: 
automated  specification  mining  of  instrumented  software  executables  [6]  and  Java 
object  behavior  mining  [59].  Process  discovery  from  samples  of  behavior  (in  event  logs) 
is  also  studied  for  discovery  of  software  process  models  [53]  and  workflow  discovery 
[202,235], 


3.2  A  Language  Recognition  Problem? 


We  refer  back  to  our  investigative  questions  presented  in  Section 


1.5 


IQ1  What  information  is  necessary  to  reverse  engineer  the  control  portion  of  appli¬ 
cation  layer  protocols  from  data  flows? 

IQ2  Given  the  proven  [7, 95]  difficulty  of  inferring  finite  automata  from  positive  sam¬ 
ples  only,  are  there  GI  approaches  that  are  appropriate  for  reverse  engineering 
automata  representations  of  the  control  portion  of  application  layer  protocols 
from  data  flows? 

Bhargavan  [24]  formulates  the  problem  of  monitoring  interactive  devices  like 
network  protocols  as  a  language  recognition  problem.  The  authors  propose  that 
given  a  specification  that  accepts  a  certain  language  of  input-output  sequences  we  can 
define  another  language  that  corresponds  to  the  externally  observable  input-output 
sequences  [24],  In  essence,  can  we  recover  the  model  of  a  protocol  given  examples 
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of  its  behavior?  Or  more  specifically,  can  we  algorithmically  turn  application  level 
network  traces  into  automata? 

With  this  background  in  mind  we  present  aspects  of  formal  language  theory 
that  frame  the  discussion  of  the  algorithm  domain  under  consideration. 


3. 3  Formal  Languages 

While  protocol  forward  engineering  efforts  can  utilize  a  range  of  formal  models 
to  represent  the  protocol  under  design  they  often  do  not  [231].  We  present  aspects 
of  formal  languages  to  frame  the  discussion  of  model  recovery  for  reverse  engineering 
purposes. 


3.3.1  Chomsky  Hierarchy. 


The  Chomsky  Hierarchy,  devised  by  Noam 


Chomsky,  as  presented  in  Table  3.1  and  Figure  3.1  is  a  widely  accepted  framework 
for  the  discussion  of  formal  representation  of  grammars  and  their  expressive  power. 
Informally,  a  sentence  is  a  string  of  symbols,  a  language  is  a  set  of  sentences,  and  a 
grammar  is  a  (finite)  list  of  rules  defining  the  language  [184], 

Three  basic  decision  problems  for  the  representations  are  the  questions  of  [3,110]: 


membership  -  is  a  given  sample  sentence  a  member  of  a  language? 
equivalence  -  are  two  grammars  able  to  recognize  the  same  language? 
emptiness  -  is  the  representation  empty? 

Angluin  further  expands  this  list  to  include  the  decision  problems  of  [11,  Section  1]: 

subset  -  are  all  elements  of  a  given  language  also  members  of  another  language? 
superset  -  does  a  given  language  contain  all  elements  of  another  language? 
disjointness  -  do  two  languages  have  no  element  in  common? 

exhaustiveness  -  is  there  a  member  of  a  given  sample  that  is  disjoint  from  the 
possible  languages? 
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The  membership  problem  for  Type-0  is  undecidablc,  Type-1  is  decidable,  Type- 
2  can  be  determined  in  polynomial  time,  and  Type-3  in  linear  time.  The  equivalence 
of  Type-0,  Type-1,  and  Type-2  are  nndecidable.  Type-3  can  be  calculated  in  poly¬ 
nomial  time  only  when  the  representation  is  a  dfaQ  Even  with  these  restrictions 
grammars  are  useful  for  studying  languages.  Grammars  give  a  compact  represen¬ 
tation  that  supports  recursivity.  Also,  grammars  support  graphical  representations 
such  as  automata  and  parse  trees  (see  Figure  [ih2|.  Finally,  even  the  “easiest”  class, 
Type-3,  contains  SAT.  boolean  functions  and  parity  functions  [254], 

The  representational  power  of  the  Chomsky  Hierarchy  is  Type-3  C  Type-2  C 
Type-1  C  Type-0.  While  Type-0,  Type-1,  and  Type-2  based  models  have  higher 
representational  power  they  are  are  more  challenging  to  evaluate  for  performance 
characteristics.  There  are  no  efficient  means  known  for  generating  parsers  for  Type- 
0  or  Type-1  languages.  Type-2  grammars  can  be  parsed  using  Earley’s  algorithm 
with  0(n3)  time  for  ambiguous  gramamrs,  0(n1 2)  for  unambiguous,  and  0(n)  for 
subcategories  [71].  Subcategories  of  Type-2  grammars  that  are  extensively  studied 
include  k  symbol  look-ahead  parsers  like  the  bottom-up  LR(k)\  and  top-down,  recur¬ 
sive  descent  and  LL(k)  parsers.  Where  the  expressive  power  is:  LL{k)  C  LR[k )  C 
Type-2  [224,  Vol  1,  Chap  3,  Sec  6.8]. 

Real  data,  such  as  network  traces,  often  directly  correspond  to  more  complex 
languages.  Application  level  protocols,  like  HTTP  [80]  and  SMTP  [113],  are  frequently 
specified  in  Baukus-Naur  Form  (BNF)[^]or  Augmented  BNF  [57]. 

Even  so,  it  is  possible  to  approximate  a  protocols  language  with  simpler  formal 
languages.  Context-free  languages  can  be  approximated  by  algorithmically  generated 
DFA  [174,181,182],  DFA  have  low  representational  power  but  can  be  analyzed  for 
performance  characteristics  within  combinatorial  limits. 

1  For  example,  the  equivalence  problem  for  a  Type-3  finite-state  transducers  is  undecidable. 

2  Backus  normal  form  or  Backus-Naur  form  [130]  is  a  widely  used  alternative  representation  for 

context-free  grammars. 
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Phrase  structure 

Type-0 


Figure  3.1:  Chomsky  Hierarchy  [184], 


Table  3.1:  Chomsky  Hierarchy.  [159] 


Chomsky  Hierarchy 

Type 

Languages 

Automaton 

Production  rules 

Type-0 

Recursively  enu¬ 
merable 

Turing  Machine  (unrestricted  or  phrase 
structure) 

o:  — ►  (3  (at,  f3  G  (V  U  S)*,  cc  contains  a 
variable  ) 

Type-1 

Context-sensitive 

Linear-bounded  non-deterministic 

Turning  Machine 

at  — ►  (3  (at,  (3  G  ( V  U£)*,|/3|  >  |a|,  a. 

contains  a  variable  ) 

Type-2 

Context-free 

Non-deterministic  pushdown  automa¬ 
ton 

A  — >  a  (A  6  V,  a  £  (V  U  £)*) 

Type-3 

Regular 

Finite  state  automaton 

A  — *  otB,A  — *  a  (A,  B  G  V,  a  G  52) 
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Table  3.2:  SMTP  Sender  Transitions. 


Operators  (E) 

States  ( Q ) 

INITIAL  (g0) 

CONNJESTABLISHED 

TRANS  ACTION.STARTED 

DATA -TRANSFER 

DATA 

- 

- 

DATA TRANSFER 

- 

HELO 

- 

CONN ESTABLISHED 

TRANS  ACTION STARTED 

- 

MAIL 

- 

TRANS  ACTION.STARTED 

- 

- 

NOOP 

- 

CONN ESTABLISHED 

TRANS  ACTION.STARTED 

- 

QUIT 

- 

INITIAL 

INITIAL 

- 

RCPT 

- 

- 

TRANSACTION-STARTED 

- 

RSET 

- 

- 

CONN ESTABLISHED 

- 

end-of-data* 

- 

- 

- 

CONN ESTABLISHED 

more-data* 

- 

- 

- 

DATA-TRANSFER 

open* 

CONN_ESTABLISHED 

- 

- 

- 

Protocol  state  can  be  represented  by  grammars,  such  as  context-free  or  finite 
state  automata,  drawn  from  the  Chomsky  Hierarchy.  If  we  select  Type-3,  in  effect, 
each  endpoint  of  the  communication  is  a  tuple  A  =  (Q,  £,  S,  g0,  F).  Q  is  a  finite  set 
of  states.  £  is  a  finite  set  of  symbols  to  label  the  transitions,  known  as  the  alphabet. 
£  is  the  finite  set  of  verbs  or  operators  in  the  protocol  vocabulary  that  lead  to  state 
transitions.  Protocol  operators  must  follow  the  syntactical  format  of  the  protocol. 
The  results  of  protocol  operators  are  the  transitions  represented  by  6.  Where  S,  the 
partial  mapping  from  Q  x  £  into  Q,  represents  all  transitions. 


For  a  simple  finite  automaton  representation  of  a  protocol  we  define  the  two 
endpoints  of  a  protocol  as  the  sender  represented  by  tuple  S  =  (Qs,  £ ,  8S ,  qso ,  Fs )  and 
receiver  represented  by  R  =  (Qr,  £,  Sr,  qr o,  Fr).  Note  that  S  and  R  share  the  same  set 
of  operators  £.  The  structure  of  a  distributed  systems  communication  is  determined 
by  the  syntactic  format  of  the  protocol  operators  that  make  up  the  finite  set  £. 


As  an  example  Figure  322  shows  the  states  and  operators  for  the  sender  of  a 

Simple  Mail  Transfer  Protocol  (SMTP).  The  figure  depicts  four  states,  the  operators 

and  resulting  state  transitions. 

Mapping  Figure  [372]  to  a  DFA  results  in  the  following: 


Q  =  {INITIAL,  D  AT  A-T  RAN  S  F  E  R,  CON  N-ESTQABLI  SHED ,  TRAN  S  ACTION  .ST  ARTED} 

E  =  {HELO,  NOOP,  QUIT,  MAIL,  RSET,  END-OF-DATA,  RCPT} 

q0  =  {INITIAL} 

F  =  {INITIAL} 


The  transition  function  represented  by  5  is  shown  in  Table  3.2 
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Figure  3.2:  State  Diagram  for  SMTP  Sender. 


It  must  be  noted  that  the  operators  marked  with  *  in  Table  |3.2|  indicate  tran¬ 
sitions  that  must  be  inferred  from  the  TCP  transport  service  level  even  though  they 


are  shown  in  Figure  3.2  Also,  the  DFA  representation  does  not  have  to  account  for 
issues  such  as  connection  loss  or  timeout  relegating  these  issues  to  the  underlying 
TCP  transport  layer. 


3.3.2  Other  Formal  Reperesentations.  Alternative  frameworks  for  language 
representation  include  pattern  languages  [8]  [224,  Chapter  6]  and  categorial  grammars 
[124],  Various  other  constructs  have  been  proposed  that  parallel  and  cross-cut  the 
Chomsky  hierarchy.  Two  examples  are  Petri  nets  and  planar  languages. 


3.3.2. 1  Petri  nets.  Petri  nets  are  another  mechanism  for  representing 
communicating  systems  that  have  analogy  to  the  Chomsky  hierarchy.  Petri  nets  have 
high  representational  power  but  formal  performance  analysis  can  be  more  difficult. 
Petri  net  variations  were  used  by  van  der  Aalst  [266]  for  workflow  mining  and  dis- 
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covering  social  networks  [265].  Mayo  [162]  proposes  a  hill-climbing  technique  to  learn 
Petri  net  models  of  gene  interactions. 

3. 3. 2. 2  Planar  languages.  Clark  presents  grammatical  inference  of 
planar  languages  using  string  kernel  methods  in  [45,46].  Planar  languages  cut  across 
the  Chomsky  Hierarchy  for  simple-subsequence  kernels  [46].  The  method  was  able  to 
learn  some  context-sensitive  and  mildly  context-sensitive  languages.  Artificial  data 
sets  and  Mat  lab®  source  code  are  available  at  the  Grammatical  Inference  with  String 
Kernels  project  web  site  [43]. 


3.4  Learning  Automata  Representation  of  a  Language 

The  process  of  learning  an  automata  can  be  expressed  as  a  decision  problem: 
Given  an  integer  n  and  two  disjoint  sets  of  words  S+  and  SC  over  a  finite  alphabet  E, 
does  there  exist  a  DFA  consistent  with  ST  and  SC  with  a  number  of  states  less  than 


or  equal  to  n.  The  learning  process  is  defined  formally  in  Definition  3.4.1 


Definition  3.4.1  (Grammar  Induction  [41]  ).  A  general  definition  of  grammar  in¬ 
duction  is  given  sets  of  labeled  example  string  S+  and  SC  such  that  S+  C  L(G)  and 
S_  C  L'(G )  infer  a  DFA  (A)  such  that  the  language  of  A  denoted  L(A)  =  L(G)  is  a 
language  generated  from  an  unknown  Type-3  grammar  (G).  Its  complement,  L'(G), 
is  defined  as  L'(G)  =  E*  —  L(G)  where  E*  is  the  set  of  all  strings  over  the  alphabet 
(E)  of  L(G). 

Learning  automata  representation  of  languages  by  grammar  induction  is  a  widely 
researched  topic.  Miclet  provides  an  introduction  in  [36,  Chapter  9].  A  survey  to  1994 
is  presented  by  Vidal  in  [270].  A  contemporary  (2005)  bibliographic  survey  of  the  field 
is  presented  by  de  la  Higuera  in  [61]. 


3.5  Computational  Learnability  Models 

There  are  three  major  formal  models  established  in  the  computational  learning 
community  for  learning  grammar  structure  from  examples  or  grammatical  inference: 
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Start 


Figure  3.3:  Gold’s  Enumeration  Procedure  -  the  decision  la¬ 

beled  Consistent  determines  if  the  current  grammar  is  consistent 
with  the  sample  presented  so  far.  Once  the  learner  enters  loop 
A  it  converges  to  the  target  language  in  the  limit  [201,  Figure 
!]• 


Gold’s  identification  in  the  limit,  Angluin’s  query  learning  model,  and  Valiant’s  prob¬ 
ably  approximately  correct  (PAC)  learning  model. 


3.5.1  Identification  in  the  Limit.  Gold  [95]  proposed  his  identification  in 
the  limit  model  in  the  late  1960’s.  Gold’s  model  describes  a  learning  process  where 
an  infinite  sequence  of  examples  from  a  grammar  G  is  presented  to  the  inference 
algorithm  M.  Figure  [A3] out  lines  Gold’s  enumeration  procedure.  The  decision  labeled 
Consistent  determines  if  the  current  grammar  is  consistent  with  the  sample  presented 
so  far.  Once  the  learner  enters  the  loop,  denoted  A,  it  converges  to  the  target  language 
in  the  limit.  The  eventual  limiting  behavior  of  the  algorithm  is  used  as  the  criterion 
of  its  success.  Gold  also  shows  that  there  is  no  general  method  of  language  inference 
from  positive  samples  that  will  do  better  than  enumeration  [95].  Although  Gold 
establishes  the  theoretical  tractability  of  grammar  induction  from  positive  samples  he 
does  not  provide  algorithmic  methods  other  than  exhaustive  enumeration. 
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3.5.2  Query  Learning  Model.  Angluin  developed  the  L*  algorithm  to  learn 
regular  languages  based  on  queries  and  counterexamples  [10].  The  inference  algorithm 
is  assumed  to  have  access  to  an  expert  teacher,  like  an  oracle.  The  teacher  can  answer 
specific  queries,  membership  and  equivalence,  asked  about  an  unknown  grammar  G. 
The  teacher  answers  a  membership  query  with  an  input  string  w  G  X*  with  an  output 
of  “yes”  if  w  is  generated  by  G  and  “no”  otherwise.  An  equivalence  query  takes  an 
input  grammar  G'  and  the  output  is  “yes”  if  the  G'  generates  the  same  language  as  G 
and  “no”  otherwise.  In  the  case  where  the  answer  is  “no”  a  string  w  in  the  symmetric 
difference  of  the  language  L(G)  generated  by  G  and  the  language  L{G')  generated  by 
G'  is  returned.  The  returned  w  is  a  counterexample. 

In  the  inference  from  a  protocol  trace  the  equivalence  test  can  at  best  be  ap¬ 
proximated  while  membership  queries  can  be  answered  by  testing  the  protocol  under 
inspection  [101,247]. 

Angluin’s  query  learning  method  is  extensively  studied.  Tradeoff  of  equivalence 
and  membership  queries  is  discussed  by  Balcazar  et  al  [14].  A  proof  technique  for 
demonstrating  the  hardness  of  learning  by  queries  regardless  of  representation  is  es¬ 
tablished  by  Aizenstein,  Hegedus,  Hcllcrstein  and  Pitt  in  [4],  Raffelt  [211]  further 
extends  the  L*  algorithm  to  deal  with  Mealy  machines.  The  problem  of  identifying 
a  value  e  >  0,  where  1  —  e  is  the  probability  that  the  oracle  answers  correctly  (or 
if  already  asked,  consistently)  is  left  as  an  open  question  in  the  held  of  grammatical 
inference  [62], 

3.5.3  PAC  Learning  Model.  Valiant  [263]  introduced  probably  approxi¬ 
mately  correct  learning  which  is  a  distribution  independent  probabilistic  model  of 
learning  from  random  examples.  The  inference  algorithm  takes  a  sample  as  input 
and  produces  a  grammar  as  output.  A  successful  inference  algorithm  is  one  that  with 
high  probability  (at  least  1  —  5)  finds  a  grammar  whose  error  is  small  (less  than  e). 
Haussler  provides  an  introduction  to  PAC  learning  in  [102],  Furthermore,  Angluin 
has  shown  that  an  equivalence  query  algorithm  can  be  translated  into  a  PAC  learning 


47 


model  [11,  Section  2.4],  In  fact,  DFA  are  not  PAC  learnablc  unless  we  are  allowed  to 
ask  membership  queries  on  an  oracle  [11]. 

3. 6  Tractability 

Gold  [95]  and  Anglnin  [7]  proved  that  when  using  passive  learning  the  problem  of 
finding  the  smallest  automaton  consistent  with  a  set  of  accepted  and  rejected  strings  is 
NPC.  Golds  Theorem  states  that  inference  on  all  regular  languages  is  impossible  with 
only  positive  samples.  Gold  further  showed  that  exact  identification  from  sparsely 
labeled  samples  is  NPC.  The  difficulty  of  the  problem  was  further  established  by  Pitt 
and  Warmuth  [203,205]  as  well  as  Pitt  and  Valiant  [204],  While  this  does  not  prevent 
a  solution  it  does  mean  the  solution  will  likely  require  approximation  or  heuristic 
techniques.  In  fact,  Lang  [137]  demonstrated  experimental  evidence  that  the  average 
case  is  tractable  and  Freund  et  al  [88]  proved  the  average  case  is  polynomial. 

3. 7  Search  Approaches  for  Grammar  Induction 

Given  the  tractability  of  grammar  induction  from  positive  samples  regardless  of 
representation  Vidal  proposes  three  classes  of  search  for  grammar  induction:  meth¬ 
ods  that  use  additional  information,  characterizeable  methods,  and  heuristic  meth¬ 
ods  [271].  We  re-dehne  Vidal’s  first  class  as  extrinsic  methods.  For  extrinsic  methods 
the  additional  information  extrinsic  to  the  target  model  includes  negative  samples, 
equivalence  queries  or  probabilistic  information.  While  characterizeable  methods  con¬ 
centrate  on  subclasses  of  regular  languages  that  are  shown  to  be  learnable  from  posi¬ 
tive  samples.  And,  finally,  heuristic  methods  which  make  direct  use  of  a  priori  intrinsic 
knowledge  of  the  target  model  such  as  the  syntactic  constraints  of  the  language. 

Muggleton  also  discusses  the  use  of  additional  information  (i.e.  negative  sam¬ 
ples,  limiting  the  number  of  states  in  target  automata,  or  assigning  statistical  val¬ 
ues  to  rank  target  automata  [177,  p.121].)  Muggleton  proposes  the  use  of  semantic 
information,  similar  to  what  we  term  intrinsic  information,  from  the  positive  sam¬ 
ples  [177,  Section  6.7]. 
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The  selection  of  search  approach  methodology  is  driven  by  what  is  known  about 
the  language  under  consideration.  A  composition  of  the  three  approaches  might  be 
appropriate  if  we  have  more  than  one  type  of  knowledge  (extrinsic,  characterizeable, 
or  intrinsic)  available. 

3.8  Notations  and  Definitions 

Before  we  discuss  specific  algorithms  we  present  the  formal  notation  and  defini¬ 
tions  that  will  be  usecj^J  The  definitions  provided  are  derived  from  [9],  [177,  Appendix 
B],  [271],  [131],  [37],  [55],  [41],  and  [62],  In  general  we  favor  the  format  used  by  [9] 
and  [55].  Formally  defining  the  mathematical  symbology  allows  us  to  discuss  the 
selected  algorithm  domain  in  a  more  compact  form. 

Definition  3.8.1  (Strings  [62]).  A  string  w  over  £  is  a  finite  sequence  w  =  aia2a3  . . .  an 
of  letters.  Let  |u>|  denote  the  length  of  w.  Letters  of  £  will  be  indicated  by  a,  b,c, . . ., 
strings  over  £  by  u,  v, . . . ,  z,  and  the  empty  string  by  A.  Let  £*  be  the  set  of  all  finite 
strings  over  alphabet  £ 

Definition  3.8.2  (Languages  [62]).  A  language  L  is  any  set  of  strings,  so  therefore 
L  C  £*.  Operations  over  languages  include:  set  operations  (union,  intersection, 
complement);  product  Lx  ■  L2  =  {uv  :  u  G  L3,v  G  L2};  powerset  L°  =  A Lni  =  Ln  ■  L; 
and  star  L*  =  UjgpjLL  We  denote  by  £  or  A  a  class  of  languages. 

Algebraic  laws  for  languages  are  discussed  in  texts  on  formal  languages  (e.g. 
[3,110,159,224]). 

Definition  3.8.3  (Learning  Sample  of  a  Language  [55]  ).  A  learning  sample  S  of  a 
language  L  is  a  finite  multi-set  of  words  from  L.  That  is  \/w  G  S,w  G  L. 

Definition  3.8.4  (Finite  State  Automaton).  A  finite  state  automaton  (FSA),  A  is 
a  quintuple  A  =  (Q,  £,  S,  /,  F),  where  Q  is  a  finite  set  of  states, £  is  a  finite  set  of 

3The  reader  is  referred  to  texts  on  formal  languages  (e.g.  [3,110,159,224])  if  they  require  back¬ 
ground  detail. 
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symbols  to  label  the  transitions,  known  as  the  alphabet,  8  is  a  partial  mapping  from 
SxE^5,  IcQis  the  set  of  start  states,  and  F  C  Q  is  the  set  of  final  states. 

The  size  of  the  automaton  is  defined  as  the  total  number  of  states  in  Q  denoted 
by  | <2|,  that  is  |A|  =  \Q\. 

PrefA^q)  will  denote  the  prefix  language  of  a  state  q  defined  by  PrefA(q)  = 
{w  G  £*| q  G  (g0,tc)}  [55,  Definition  1], 

SuffA(q)  will  denote  the  suffix  language  of  a  state  q  defined  by  PrefA(q)  = 
{w  G  £*| 8(q,w)  D  TV  0}  [55,  Definition  1], 

The  FSA  is  minimized  if  no  pair  of  states  are  equivalent.  Given  g*,  qj  G  Q,i  ^  j, 
there  is  an  input  word  x  that  distinguishes  them  such  that  8(qi,x )  8(qj,x). 

Definition  3.8.5  (Deterministic  Finite  State  Automaton).  A  FSA  is  deterministic 
or  a  deterministic  finite  state  automaton  (DFA)  Vg  G  Q, Vo  G  £,  8(q,a )  has  at  most 
one  element  otherwise  the  FSA  is  non-deterministic.  Additionally,  |/|  =  1  with  the 
start  state  denoted  by  g0  where  go  is  the  single  element  of  /. 

Definition  3.8.6  (Acceptance  [55]).  An  acceptance  for  a  word  w  G  S*,  where  w  = 
010203  a|w|,  in  an  automaton  A  =  (Q,  £,  5,  go,  F)  is  a  sequence  (go,  •  •  • ,  g|w|)w  of 

|iu|  +  1  states  such  that  go  G  Qfi/i  G  [1,  |w|],  g*  G  <5(g*-i,  cn),q\w\  £  F.  g0  is  said  to  be 
the  initial  state  and  q\w \  is  said  to  be  the  final  state  of  the  acceptance.  Transitions 
qt- 1,  Oi,  qi  are  said  reached  by  the  acceptance.  The  set  of  acceptances  of  a  word  w  in 
automaton  A  is  denoted  by  Acca(w). 

Definition  3.8.7  (Regular  Grammar).  A  regular  grammar  is  a  grammar  in  which 
all  of  the  productions  are  of  the  form  A  — >  aB  or  A  — >  A.  Given  a  finite  automaton 
A  =  (QWo,  F),  it  is  possible  to  construct  a  regular  grammar  G  =  {N,T,  P,  S) 
such  that  L(A)  =  L(G)  as  follows.  For  each  state  in  Q  add  a  non-terminal  of  the 
same  name  to  N.  For  each  symbol  in  £  add  an  equivalent  symbol  to  T.  Let  S  be  the 
non-terminal  named  g0.  For  each  g  G  F  add  a  production  to  P  of  the  form  g  — »  A.  For 
each  transition  of  the  form  8(qi,a)  =  g2  add  a  production  to  P  of  the  form  gx  — >  aq2. 
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Definition  3.8.8  (Regular  Languages).  The  transition  function  5  of  a  DFA  can  be 
extended  to  E*  :  S(q,  A)  =  q  and  5(q,a.w )  =  8(S(q,a),w)  for  all  q  G  Q,  a  G  E, 
w  G  E*.  Let  L(A)  denote  the  language  recognized  by  the  automaton  A  :  L(A)  = 
{w  G  E*| 5(q0,w)  G  F}  By  definition  the  language  L(A)  accepted  by  a  DFA  A  is  a 
regular  language.  That  is  the  class  of  regular  languages  can  be  defined  as  the  class  of 
languages  accepted  by  a  finite  automata. 

Definition  3.8.9  (Type-3  Grammar  [41]  ).  A  Type-3  Grammar  is  a  four-tuple  G  = 
(T,  N,  P,  S )  where: 

•  T  C  N  is  a  finite  non-empty  set  called  the  terminal  alphabet  of  G, 

•  N  is  a  finite  non-empty  set  called  the  total  vocabulary  of  G, 

•  P  is  a  finite  set  of  production  rules, 

•  S  G  (N  —  T)  is  referred  to  as  the  start  state,  and  the  rules  in  P  are  of  the  form 
A  — >  aB  or  A  — »  a,  where  A,  B  e  (N  —  T)  and  a  G  T 

Definition  3.8.10  (Canonical  Automaton  [55]  ).  A  Canonical  Automaton  (CA)  of  a 
language  L ,  denoted  by  CA(L)  is  the  sole  minimal  DFA  accepting  L. 

Definition  3.8.11  (Universal  Automaton  [55]  ).  A  Universal  Automaton  (UA)  of  a 
language  L,  denoted  by  UA(L)  is  the  canonical  automaton  A(E*)  accepting  all  words 
wGh 

Definition  3.8.12  (Maximal  Canonical  Automaton  [55]).  A  Maximal  Canonical  Au¬ 
tomaton  (MCA)  related  to  a  learning  sample  S,  denoted  by  MCA(S)  or  more  simply 
MCA,  is  the  union  for  each  word  w  of  learning  sample  S  from  a  canonical  automata 
A({w}).  A  MCA  is  a  star-like  automaton  that  exactly  recognizes  data  from  sample 
S. 

Definition  3.8.13  (Partition  of  set  S  [177]).  A  partition  of  set  S ,  denoted  by  tts,  is 
a  set  of  pairwise  disjoint  non-empty  subsets  of  set  S  s.t.  the  union  of  all  7Ts  is  equal 
to  S. 
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Definition  3.8.14  (Block  [177]).  The  unique  block  of  tt$  containing  s,  where  s  G  S 
is  denoted  B(s,tts)- 

Definition  3.8.15  (Refines  [177]).  Given  two  partitions,  ir  and  ir'  it  refines  it'  iff 
every  block  of  tt’  is  a  union  of  blocks  of  it. 


Definition  3.8.16  (Prefix  Tree  Acceptor  [55]).  A  Prefix  Tree  Acceptor  (PTA)  on  a 
sample  S,  PTA(S)  or  PTA ,  is  the  deterministic  automaton  obtained  when  merging 
every  state  of  MCA(S)  with  identical  prefix  languages. 


A  PTA  is  a  FSA  that  can  be  constructed  by  laying  out  the  strings  in  a  language 
using  a  state  to  represent  each  unique  prefix  of  one  of  the  strings.  A  language  accepted 
by  a  PTA  is  regular  and  exactly  accepts  all  strings  in  a  given  language. 


Figure  3  A  shows  the  PTA  for  POP3  client  commands  sent  to  servers  from  week 
1  day  1  inside  IDEVAL  traffic. 


Definition  3.8.17  (Augmented  Prefix  Tree  Acceptor  [41]  ).  A  Augmented  Prefix  Tree 
Acceptor  (APTA)  is  a  six-tuple  G  =  (Q,  E,  S,  s,  F+,  F_)  where: 


•  Q  is  a  finite  non-empty  set  of  nodes, 

•  E  is  a  finite  non-empty  set  of  input  symbols  or  input  alphabet, 

•  6  :  Q  x  E  — »  Q  the  transition  function, 

•  s  G  Q  the  start  or  root  node, 

•  F+  C  Q  identifies  final  nodes  of  strings  in  S+, 

•  A-  Q  Q  identifies  final  nodes  of  strings  in  S-, 

The  size  of  an  APTA  is  defined  as  the  total  number  of  elements  in  Q  denoted  \Q\. 

Definition  3.8.18  (APTA  node  equivalence  [41]).  Two  nodes  qi  and  qj  in  an  APTA 
are  considered  not  equivalent  if  and  only  if: 


•  ( Qi  G  F+  A  qj  G  F_)  V  (^  G  F_  A  qj  G  F+),  or 
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Figure  3.4:  Example  PTA  -  Command  PTA  for  POP3  from 
Week  1  Day  1  inside  IDEVAL  traffic. 
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•  3s  G  E  such  that  if  (%,  s,  q^)  G  S,(qj,s,qji )  G  <5  then  q,/  is  not  equivalent  to  q.y. 

Definition  3.8.19  (Strictly  locally  testable  languages  [37]).  Let  k  be  a  positive  inte¬ 
ger.  For  w  G  E+  of  length  >  k,  let  Lk(w) ,Rk(w) ,  and  Ik(ui)  be  respectively  the  prefix 
length  k,  the  suffix  length  k  and  the  set  of  interior  factors  of  length  k  of  the  word  w. 
L  C  E*  is  strictly  /c-testable  if  and  only  if  there  exist  three  sets  X,  Y,  Z  of  words  on  E 
such  that  for  all  w  G  E+,  |w|  >  k,  w  G  L  iff  Lk{w)  G  X ,Rk(w)  G  Y,  and  Ik(w)  C  Z. 
A  language  is  strictly  locally  testable  if  it  is  strictly  fc- testable  for  some  k  >  0.  We 
denote  the  class  of  strictly  locally  testable  languages  as  s CT. 

Definition  3.8.20  (Left  Quotient  [9,131,177]).  Let  L  be  any  language.  Pre^L)  The 
set  of  all  prefixes  of  elements  of  language  L. 

For  any  w  G  E*,  we  denote  the  left-quotient  of  language  L  and  word  w  by  w\L. 
That  is  w  \  L  =  {x  G  E*| wx  G  L}.  Angluin  introduced  the  equivalence  relation  =L 
over  E*  defined  as:  w\  =l  u>2  iff  w\  \  L  =  ta2  \  L.  A  language  L  is  regular  iff  the 
number  of  equivalence  classes  of  =l  is  finite. 

Definition  3.8.21  (k-tails  [177]).  The  k-tails  of  word  w  in  language  L  denoted  by 
w  \k  L  is  the  set  {v  :  v  G  w  \  L,  |u|  <  k}.  That  is,  the  k- tail  is  the  set  of  all  words  in 
the  language  that  are  members  of  the  left  quotient  with  a  depth  of  no  more  than  k. 

Definition  3.8.22  (k-reversible  Languages  [9,  131]).  A  language  L  is  pseudo  k- 
reversible  iff  whenever  u\uw  and  u2uw  are  in  L  and  |u|  =  k  ,  U\  =l  u2v  holds. 
A  language  L  is  L-reversible  iff  L  is  regular  and  pseudo-k-reversible.  For  any  non¬ 
negative  integer  k ,  we  denote  the  class  of  k-reversible  languages  by  lZevk.  It  is  known 
that  for  any  non-negative  integer  k,  the  class  TZevk  is  properly  contained  in  the  class 
Uevk+l. 

3.9  Grammatical  Inference  Algorithms 

Several  algorithms  exist  for  inferring  grammars  for  language  understanding. 
The  algorithms  have  been  used  in  a  range  of  GI  tasks  including  machine  learning,  for- 
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Table  3.3:  Regular  Inference  Algorithms. 


Regular  Inference  Algorithms 

Algorithm 

Negative  Samples 

Required 

Target  Automaton 

Trakhtenbrot  and  Barzdin 

Yes 

DFA 

ECGI 

No 

DFA,  PFSA  [269,  Section  2.2] 

fc-TSSI 

No 

fc-testable  DFA 

fc-RI  Angluin 

Optional 

fc-reversible  DFA 

fc-RI  Muggleton 

Optional 

uniquely  r-terminated  fc-reversible  DFA 

MGGI 

No 

DFA 

RIG  and  BRIG 

Yes 

DFA 

RPNI 

Yes 

DFA 

EDSM 

Yes 

DFA 

mal  language  theory,  structural  recognition,  natural  language  processing,  and  speech 
recognition  [61]. 


Algorithms  can  be  classified  according  to  their  target  language/grammar:  Type- 
3,  regular;  Type-2,  context-free;  Type-1,  context-sensitive;  or  Type-0,  phrase  struc¬ 


ture  (see  Figure  3.1).  Algorithms  can  also  be  characterized  according  to  the  classes 


discussed  in  Section  3.7  that  is:  extrinsic,  characterizable,  or  heuristic.  The  main 
types  of  extrinsic  information  we  are  concerned  with  are  negative  samples,  and  queries. 


3.10  Inference  of  Regular  Languages  (Type- 3) 

Inference  of  regular  languages  is  well  studied.  Muggleton  provides  an  intro¬ 
duction  to  regular  inference  in  [177,  Chap  6].  Gronfors  [99]  conducts  experimental 
analysis  of  several  of  algorithms  for  generality.  Dupont  presents  an  early  look  at  the 
search  space  of  regular  inference  in  [70].  Hingston  describes  various  approaches  to 
develop  a  family  of  regular  inference  algorithms  [106] .  Coste  and  Fredouillc  provide  a 
discussion  of  the  search  space  of  inference  of  DFA,  NFA,  and  unambiguous  finite  au¬ 
tomaton  in  [55].  While  Coste  et  al  [54]  discuss  the  importance  of  domain  and  typing 
background  knowledge  to  tune  inference  algorithms. 

The  limitations  on  inference  of  the  union  of  multiple  languages  from  intermixed 
samples  is  discussed  by  [278]  who  refines  Angluin’s  necessary  and  sufficient  conditions 
for  inference.  Given  that  we  do  not  know  if  application  protocol  languages  meet 
the  conditions  defined  by  [278]  we  will  not  further  consider  intermixed  samples  for 
bilingual  inference. 
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Table  3.4:  Regular  Inference  Algorithm  Performance. 


Regular  Inference  Algorithm  Performance 

Algorithm 

OQ 

Notes 

Trakhtenbrot  and  Barzdin 

0(mnz) 

where  m  =  |initial  APTA|,  n  =  number 
of  states  in  the  final  hypothesis  automa¬ 
ton. 

ECGI 

not  characterized 

experimental  evidence  from  [227]. 

fc-TSSI 

0(kn  log  n) 

where  n  is  the  sum  of  the  lengths  of  all 
the  strings  in  S (  [89]. 

fc-RI  Angluin 

0((k  +  1  )zn6)  [9,  Theorem  35] 

where  n  is  the  sum  of  the  lengths  of  all 
strings  in  the  sample. 

fc-RI  Muggleton 

0(nz)  [177,  Section  6.5] 

where  n  is  the  sum  of  the  lengths  of  all 
strings  in  the  sample. 

MGGI 

not  characterized 

RIG  and  BRIG 

non-polynomial 

experimental  evidence  from  [169,  Sec¬ 

tion  3.5] 

RPNI 

0((m  +  m'  )m2) 

where  m  is  the  sum  of  the  length  of  all 
strings  in  Sk).  and  m'  is  the  sum  of  the 
length  of  all  strings  in  S— 

EDSM 

not  characterized 

Table  3.3  summarizes  the  characteristics  of  several  algorithms  used  for  inference 


of  regular  languages.  Table  3.4  summarizes  the  performance  characteristics  of  several 


of  the  algorithms  discussed. 


3.10.1  Statistical  Extrinsic  Methods.  Muggleton  provides  a  framework  to 
represent  several  statistical  extrinsic  methods  to  determine  state  merges  by  defining 
a  generalized  algorithm  using  a  predicate  function  x(ujv)  [177,  Appendix  C].  The 
generalized  algorithm  is  shown  in  Algorithm  [lj  where  x(u,u)  takes  values  shown 
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Table  3.5  for  some  fc-tails  algorithms. 


1.1 

1.2 

1.3 

1.4 

1.5 

1.6 

1.7 

1.8 
1.9 

1.10 

1.11 

1.12 


Input:  S+  non-empty  set  of  positive  sample  strings 
Output:  The  acceptor  Aq /'Kpr(S+) 

/  initialization 

/*Aq  is  a  DFA  represented  by  (Qo,  £,  <5o,  Io,  Fo) 

Let  Aq  =  FormPT A(S+)] 

Let  7To  be  the  trivial  partition  of  Q o; 

Let  2  =  0; 

/*Merging 
for  V(u,  v)  E  Q o  do 
if  x(u->  v)  then 

Let  B\  =  B(u,ni ); 

Let  L?2  =  B(v,ni)] 

Let  7Tj_|_i  be  ni  with  B\  and  B2  merged; 
Increment  i  by  1; 

end 

end 

/  *Terminat  ion 
Let  f  =  i] 


1.13  return  The  acceptor  Ao/iif 


7 

7 

7 


7 


Algorithm  1:  Muggleton  Algorithm  IM1  [177,  p.100] 
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Table  3.5:  Muggleton  Predicate  Functions  x(ujv)  f°r  &-tails.  [177,  Appendix  C] 


Muggleton  Predicate  Functions  for  fc-tails 

Source 

v) 

Biermann  and  Feldman  [28] 

,  \  J  true  u\k  S+  =  v\k  S+  \ 

^  false  otherwise  J 

Levine  [148] 

,  .  f  true  Strenfu,  v)  ^  Strn  1 

X(U,  V)  =  <  r  ,  A  > 

1  false  otherwise  j 

,  where 

c  ,  ,  max  [2|aVS+n„VS+|l  .  ^ 

otrenlu,  v)  =  .  - - -  ,  i  G  /L 

*  [|u\*  S+l  +  |u\*  S+|J 

,  and 

Stren[ 0,  1]  G  1R 

Miclet  [168] 

=  (  hr  “\s+n^\s+^0 1 

false  otherwise  J 

Another  statistical  extrinsic  method,  Minimal  Descriptor  Length  (MDL)  as  pre¬ 
sented  by  [146],  also  fits  into  the  Muggleton  predicate  form  but  does  not  use  fc-tails. 
Instead 


true  {\DFA'\  +  \DFA'(S+)\}  <  {\DFA\  +  \DFA(S+)\}  I 

X(u,v)  =  <  > 

I  false  otherwise  I 

,  where  DFA  is  initially  the  PTA  of  S+  and  DFA 1  is  the  hypothesis  DFA.  DFA ‘ 
replaces  DFA  for  each  successful  iteration. 

3.10.2  Extrinsic  Negative  Sample  Support  Methods.  Several  algorithms  exist 
that  require  extrinsic  negative  sample  support  including:  Trakhtenbrot  and  Barzdin’s 
algorithm;  Miclet’s  Regular  Inference  of  Grammars,  and  Regular  Positive  and  Nega¬ 
tive  Inference. 

3.10.2.1  Trakhtenbrot  and  Barzdin.  One  of  the  earliest  algorithms 
was  proposed  by  Trakhtenbrot  and  Barzdin^]  [41].  We  classify  the  algorithm  as  an 

4We  were  unable  locate  a  copy  of  the  original  presentation  by  Trakhtenbrot  and  Barzdin  in  [256]. 


57 


extrinsic  method  because  it  was  designed  for  completely  labeled  data  sets  containing 
both  positive  and  negative  examples.  According  to  Cicchello  [41,  Section  10.1.3]: 

The  upper  bound  on  the  runtime  is  mn2,  where  m  is  the  total  number  of 
nodes  in  the  initial  APTA  and  n  is  the  total  number  of  states  in  the  final 
hypothesis. 

Cicchello  presents  a  modification  that  supports  use  of  the  algorithm  with  incomplete 
training  sets  in  [41], 

3.10.2.2  Regular  Inference  of  Grammars.  Regular  Inference  of  Gram¬ 
mars  (RIG)  requires  extrinsic  negative  sample  support  [169].  Miclet  also  presents 
Boosted  beam-search  Regular  Inference  of  Grammars  (BRIG).  BRIG  also  requires 
extrinsic  negative  sample  support  [169].  Miclet  describes  both  RIG  and  BRIG  as  in¬ 
efficient  [169,  Section  3.5]  and  concludes  from  experimental  results  that  the  algorithms 
are  non-polynomial  [169,  Section  5]. 

3.10.2.3  Regular  Positive  and  Negative  Inference.  Regular  Positive 
and  Negative  Inference  (RPNI)  requires  extrinsic  negative  sample  support.  The  al¬ 
gorithm  was  proposed  by  Lang  [137]  and  independently  by  Onc-ina  and  Garcia  [189]. 
The  RPNI  algorithm  has  been  shown  to  identify  in  the  limit  regular  languages.  The 
complexity  is  a  function  of  the  positive  and  negative  sample  sizes.  RPNI  has  an  up¬ 
date  time  of  0((m  +  m')m 2)  where  m  is  the  sum  of  the  length  of  all  positive  data 
(S'.)-)  and  m!  is  the  sum  of  the  length  of  all  negative  data  (ST). 

RPNI  works  by  starting  with  the  PTA,  and  merging  pairs  of  states  if  possible, 
using  a  fixed  depth-first  ordering  of  state  pairs.  The  algorithm  runs  in  polynomial 
time  and  is  guaranteed  to  identify  the  target  FSA  given  complete  sample  data. 

One  limiting  factor  of  the  RPNI  algorithm  is  that  it  requires  presentation  of  the 
entire  positive  and  negative  sample  data.  If  new  data  is  available  the  inference  process 
must  be  restarted.  A  modification  to  allow  for  incremental  inference  was  proposed 
by  Dupont  [69].  The  algorithm  is  modified  to  support  noisy  samples  and  presented 
as  RPNI*  by  [237].  More  recently,  Hoffman  provides  experimental  evidence  that 


prohibiting  some  of  the  merges  performed  by  the  original  RPNI  algorithm  improved 
performance  against  artificial  random  data  sets  [107]. 

3.10.2.4  Evidence  Driven  State  Merging.  Evidence  Driven  State  Merg¬ 
ing  (EDSM)  requires  extrinsic  negative  sample  support.  The  basic  algorithm  is  de¬ 
scribed  by  Lang  in  [138].  EDSM  performs  merges  in  arbitrary  order  such  that  both 
nodes  in  a  merge  might  be  the  roots  of  arbitrary  subgraphs  of  the  hypothesis  automa¬ 
ton  [138].  To  overcome  this  a  modification  to  EDSM  called  the  blue-fringe  algorithm 
restricts  the  merge  order  [138]  using  a  policy  described  by  [123].  The  algorithm  is 
further  modified  to  support  noisy  samples  and  presented  as  BLUE*  by  [237]  as  part 
of  the  Learning  DFA  form  Noisy  Samples  competition  [153]  for  the  GECC02004 
conference  [262], 


3.10.3  Characterizeable  Methods.  Given  Gold’s  result  that  regular  languages 
cannot  be  inferred  in  the  limit  from  only  positive  data  [95]  the  search  for  character- 
izable  subclasses  that  can  be  inferred  with  only  positive  data  has  become  a  kind  of 
“holy-grail”  of  grammatical  inference.  Many  subclasses  of  regular  languages  have 
been  proposed:  strictly  regular  languages  [284],  fc-reversible  [9],  locally  testable  lan¬ 
guages  in  the  strict  sense  [89],  code  regular  languages  [74],  and  Szilard  languages  of 
regular  grammars  [156,282], 


Figure  3J3  shows  some  of  the  families  of  languages  that  are  classified  within 
regular  languages.  Two  well  studied  characterizable  methods  are  ^-Testable  in  the 
Strict  Sense  Inference  and  &;-Reversible  Inference. 


3.10.3.1  k-Testahle  in  the  Strict  Sense  Inference.  The  inductive  infer¬ 
ence  of  the  class  of  A;-Testable  languages  in  the  strict  sense  was  proposed  by  Garcia 
and  Vidal  in  1990  [89] .  A  language  that  is  fc-Testable  in  the  Strict  Sense  of  Inference 
(fc-TSSI)  is  defined  by  a  finite  set  of  substrings  of  length  k  that  are  permitted  in  the 
target  language  [89] .  This  algorithm  is  a  characterizable  method  that  limits  the  target 
model  to  fc-testable  languages. 
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Other  Classes  of  Languages 


Context-Free  Languages 


Figure  3.5:  Some  Families  of  Regular  Languages  [271]. 

A  ^-testable  languages  in  the  strict  sense  (fc-TSSL)  a  subclass  of  ^-testable 
languages.  It  is  essentially  defined  by  a  finite  set  of  substrings  of  length  k  that  are 
permitted  to  appear  in  the  strings  of  the  language.  Given  a  positive  learning  sample 
S+  of  strings  of  an  unknown  language,  a  deterministic  finite-state  automaton  that 
recognizes  the  smallest  /c-TSSL  containing  S+  is  obtained. 

The  the  number  of  transitions  in  the  inferred  automaton  is  bounded  by  0(m) 
where  m  is  the  number  of  substrings  defining  the  £;-TSSL,  and  the  inference  algorithm 
works  in  0(kn\ogn )  where  n  is  the  sum  of  the  lengths  of  all  the  strings  in  S+  [89, 
Theorem  6.1]. 

Torres  and  Varona  [255]  presents  a  low-level  representation  of  A;-TSS  structures 
proposed  for  use  in  continuous  speech  recognition.  Varona  and  Torres  also  conducted 
experimental  analysis  with  k  values  of  4  and  5  on  smoothed  stochastic  FSA  for  con¬ 
tinuous  speech  recognition  [268]. 
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Experiments  by  [89,  Section  VIII]  show  the  ability  of  (stochastic)  fc-TSSLs  to 
approach  other  classes  of  regular  languages.  An  algorithm  for  fc-TSSI  is  outlined 
in  [89,  Figure  1]. 


3.10.3.2  k-Reversible  Inference.  k- Reversible  Inference  (fc-RI),  intro¬ 
duced  by  Angluin,  does  not  require  negative  sample  support  [9].  Like,  fc-TSSI,  k- RI 
is  a  characterizable  method  restricting  the  target  automaton  to  fc-reversible  regular 
languages  [9]. 


The  class  of  fc-reversible  regular  languages  is  a  subset  of  the  regular  languages 
with  the  properties  explained  in  Definition  3.8. 22|  The  target  language  must  be  k- 
reversible  for  some  k  >  0.  The  A;-RI  algorithm  identifies  the  minimum  fc-reversible 
language  containing  any  finite  positive  sample  in  0((k  +  l)2n3)  time,  where  n  is  the 
summation  of  the  lengths  of  the  strings  in  the  sample  [9,  Theorem  35]. 


This  was  later  reduced  by  Muggleton  to  0(n 2)  time  with  the  added  restriction 
that  the  target  automaton  is  uniquely  terminated,  that  is  the  automaton  is  a  uniquely 
T-terminated  acceptor  [177,  Section  6.5].  A  uniquely  t -terminated  acceptor  is  a  FSA 
with  the  property  that  any  transition  arc  is  labled  with  the  termination  symbol  r  if 
it  leads  to  an  acceptor  state  and  that  the  acceptor  state  has  no  outgoing  arcs  [177, 
Section  6.5.1]. 


3.10.4  Heuristic  Methods.  Heuristic  methods  leverage  intrinsic  knowledge  of 
the  target  automata.  Heuristic  methods  concentrate  on  producing  a  target  model  that 
is  useful  for  a  problem  domain  not  necessarily  considering  the  automata’s  membership 
in  a  language  theoretic  characterizable  class.  Algorithms  in  this  category  include: 
Morphic  Generator  Grammatical  Inference,  the  Burge  algorithm,  Continuous  Time 
Markov  Chain  models,  and  kBehavior. 


3.10.4.1  Morphic  Generator  Grammatical  Inference.  Morphic  Gen¬ 
erator  Grammatical  Inference  (MGGI)  does  not  require  extrinsic  negative  sample 
support.  The  inference  procedure  was  introduced  by  Garcia  as  the  “Local  Language 
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Inference  Algorithm”  [90] .  Sanchis  discusses  the  use  of  MCGI  for  inferring  phonetic 
units  [232],  Vidal  outlines  the  learning  approach  for  MGGI  regular  inference  in  [271] 
but  does  not  provide  an  algorithmic  implementation. 

MGGI  works  with  two  finite  alphabets,  E  and  T!  and  a  set  of  positive  sample 
strings  S+  C  E*  [271].  A  function  g  is  used  to  rename  the  words  in  S+  resulting  in 
S'_ |_  =  g(S+)  where  S'+  C  E*  [271].  The  corresponding  2—TS  language  is  obtained  from 
S'+  and  another  renaming  function  h  is  applied  [271].  The  MGGI  inferred  language 
is  L  =  h(l(g(S'+))).  The  renaming  function  is  the  morphic  generator  which  allows  for 
generalization  of  the  language  under  consideration.  The  renaming  function  relies  on 
extrinsic  knowledge  of  the  model  under  consideration. 

We  did  not  discover  a  formal  analysis  of  the  performance  characteristics  of 
MGGI.  We  did  discover  experimental  empirical  results  specific  to  speech  recognition 
in  [232,271].  Local  language  learning,  similar  to  MGGI,  is  applied  to  DNA  sequence 
analysis  by  [283]. 

3. 10. 4- 2  Burge.  The  Burge  algorithm^  which  does  not  require  nega¬ 
tive  samples,  is  presented  by  Ingham  in  [114,115].  The  algorithm  is  0(nm )  where  n 
is  the  number  of  samples  in  the  training  set  and  m  is  the  average  number  of  tokens 
in  a  sample  [114,  Section  3.9].  Ingham  presents  modification  to  the  algorithm  to  sup¬ 
port  incremental  learning  of  DFA  models  from  tokenized  HTTP  requests.  While  [115] 
does  generate  a  notional  model  of  the  HTTP  request  the  focus  is  on  approximation 
of  HTTP  for  intrusion  detection  not  model  recovery. 

3.10.4- 3  Continuous  Time  Markov  Chains.  Sen  et  al  examine  the  use 
of  grammatical  inference  inspired  algorithms  to  learn  edge  labled  Continuous  Time 
Markov  Chains  [238].  Java  source  code  for  their  implementation  is  available. 

5  John  Burge  is  a  co-author  of  [115]. 
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3. 10. 4-4  kBehavior.  The  fc- tails  approach  is  modified  by  Mariani 
and  Pezze  and  presented  as  kBehavior  which  is  an  incremental  approach  designed 
for  limited  storage  capacity  [158,  Section  2],  The  technique  uses  a  heuristic  to  merge 
multiple  states  that  are  recognized  as  a  common  behavior  instead  of  individual  states. 
This  is  in  part  to  improve  branch  and  loop  detection  from  execution  traces. 

3.10.5  Hybrid  Methods.  There  are  several  randomized  and  heuristic  ap¬ 
proaches  to  regular  language  inference  that  do  not  neatly  fit  into  our  categories  of 
extrinsic,  characterizeable,  or  heuristic.  If  we  have  both  intrinsic  and  extrinsic  infor¬ 
mation  hybrid  methods  are  possible.  One  such  method  is  Angluin’s  L*  algorithm, 
another  is  Error  Correcting  Grammatical  Inference. 

3.10.5.1  Angluin’s  L*  Algorithm.  Angluin  also  presents  a  modification 
to  the  A;-RI  algorithm  using  both  extrinsic  negative  samples  and  queries  [9,  Section  7]. 
An  overview  of  Angluin’s  L*  algorithm  is  presented  by  Berg  in  [23].  Berg  discusses 
evaluation  of  L*  (as  presented  by  Angluin  1987  [10])  for  prefix-closed  DFA[^]  against 
random  samples  and  real  world  examples  drawn  from  the  Edinburgh  Concurrency 
Workbench])]  [23] . 

3.10.5.2  Error  Correcting  Grammatical  Inference.  Error  Correcting 
Grammatical  Inference  (ECGI)  proposed  by  Rulot  and  Vidal  is  a  GI  heuristic  that 
incrementally  infers  the  target  automata  model  [226].  ECGI  combines  statistical 
extrinsic  methods  and  heuristic  methods. 

The  approach,  which  does  not  require  negative  sample  support,  is  based  on  error 
correcting  parsing.  The  ECGI  algorithm  builds  a  hypothesis  automaton  by  initially 
creating  a  trivial  automaton  from  the  first  presented  sample  word  [226].  States  and 
transitions  are  added  to  the  hypothesis  automaton  for  every  new  unrecognized  sample 
[226] .  Error  correcting  parsing  is  used  to  determine  what  states  and  transitions  to  add 

6A  languages  £  is  prefix-closed,  if  Vw  £  £,  then  \/Prefc(w )  €  £  [136,  Definition  3.2]. 

'Edinburgh  Concurrency  Workbench  http :  / /homepages .  inf  .  ed .  ac . uk/perdita/cwb/ 
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by  searching  for  the  best  path  for  the  input  sample  in  the  hypothesis  automaton  [226]. 
A  statistical  extrinsic  method,  such  as  Hamming  or  Levenshtein  distance,  can  be  used 
to  measure  which  path  is  best.  Heuristic  restrictions  eliminate  loops  and  circuits  in 
the  inferred  hypothesis  automaton. 

We  did  not  discover  a  formal  analysis  of  the  performance  characteristics  of 
ECGI.  We  did  discover  experimental  empirical  results  specific  to  speech  recognition 
in  [232], 

The  ECGI  algorithm  has  also  been  applied  in  language  modeling  [209]  and 
parts-of-speech  tagging  [206,207].  Sanchis  also  discusses  the  use  of  ECGI  for  inferring 
phonetic  units  [232],  Rulot  also  proposes  an  extension  to  the  ECGI  algorithm  [226] 
to  support  stochastic  target  automata.  The  stochastic  extension  is  expanded  by  [269, 
Section  2.2], 

3.10.5.3  Other  Approaches.  Graine  [97]  introduces  a  method  for  learn¬ 
ing  regular  languages  with  constant  alphabet  sizes  using  neural  networks.  The  method 
is  0(n2)  time  complexity  for  /^-reversible  regular  languages.  Giordano  examines  in¬ 
ference  of  regular  languages  by  a  tabu  search  that  requires  both  positive  and  negative 
examples  [92],  Belz  proposes  a  genetic  algorithm  for  automata  inference  [21].  Ni- 
parnan  [183]  and  Lai  [136]  also  examine  genetic  algorithm  approaches  to  inference  of 
finite  automata. 

3.11  Inference  of  Higher  Order  Languages 

Gramatical  inference  of  context-free  languages  (Type-2)  has  received  some  at¬ 
tention.  Early  work  was  conducted  by  [228]  and  [60].  Lee  [145]  provides  a  circa 
1994  survey  of  literature  to  that  point.  The  2004  Omphalos  Context-Free  Language 
Learning  Competition  held  in  conjunction  with  the  7th  International  Colloquium  on 
Grammatical  Inference  [246]  generated  experimental  results  for  artificially  generated 
data.  More  recently  Oates  [187]  studied  ^-reversible  CFG,  Nakamura  presented  an 
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incremental  CFG  learning  algorithm  [179],  while  Petasis  [197]  and  Javed  [120]  both 
propose  genetic  algorithm  approaches. 

Non-terminally  separated  languages  (NTS)  a  subclass  of  deterministic  context- 
free  languages  (where  Type-3  C  NTS  C  Type-2)  have  also  received  some  attention. 
Clark  [44]  recently  examined  PAC-learning  of  NTS  languages.  Another  languages 
class  that  cross-cuts  the  Chomsky  hierarchy  is  the  class  of  very  simple  grammars 
proposed  by  Yokomori  which  contains  elements  of  0-reversible,  left  Szilard  of  linear, 
regular  (Type-3),  and  NTS  languages.  [282], 

We  did  not  discover  attempts  to  directly  infer  context-sensitive  (Type-1)  or 
recursively  enumerable  (Type-0)  languages. 

3.12  Chapter  Summary 

In  this  chapter  we  related  the  problem  domain  of  dynamic  protocol  reverse 
engineering  from  network  traces  to  the  algorithm  domain  of  grammatical  inference. 
We  introduced  the  Chomsky  Hierarchy  as  a  framework  for  discussing  computational 
learnability.  Next,  we  developed  the  symbolic  model  and  mathematical  notation  that 
defines  the  characteristics  of  the  algorithm  domain.  Finally,  we  discussed  several 
existing  algorithmic  and  heuristic  approaches  to  grammatical  inference. 
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IV.  Experimental  Design 

The  purpose  of  this  chapter  is  to  establish  the  methodology  we  will  use  to  evaluate 
existing  algorithms  for  effectiveness  and  efficiency  in  the  dynamic  protocol  reverse 
engineering  discipline  to  establish  empirical  evidence  for  the  applicability  of  gram¬ 
matical  inference.  A  hybrid  application  of  state-of-practice  format  recognition  and 
grammatical  inference  is  presented  which  could  support  the  use  of  formal  techniques 
to  identify  vulnerabilities  in  the  specification,  implementation,  and  deployed  configu¬ 
ration  of  network  protocols.  We  outline  the  approaches  we  implement  in  our  experi¬ 
mental  design  and  detail  the  results  of  format  recovery  and  control  flow  recovery.  The 
main  emphasis  is  on  techniques  useful  for  protocol  control  flow  (<5)  recovery. 

4-1  Application  Level  Network  Traces  into  Automata 

Once  again,  we  must  address  the  following  four  issues: 

•  Network  trace  collection. 

•  Application  level  protocol  data  flow  recovery. 

•  Protocol  format  (£)  recovery. 

•  Protocol  transition  function  (5)  recovery. 

None  of  the  methods  discussed  in  Chapter  |TT]  or  Chapter  Em  provide  an  au¬ 
tomated  means  of  naming  or  uniquely  identifying  the  operators  that  make  up  the 
vocabulary  of  an  arbitrary  protocol.  For  this  reason  we  propose  mining  the  protocols 
operator  packet  formats  from  existing  open  sources  and  algorithmically  generating  an 
automata  representation.  The  choice  of  development  tools  and  supporting  toolkits 
are  explained  in  Appendix  [Bj 

4-2  Protocol  Selection 

Specifications  for  open  protocols,  such  as  SMTP  and  POP3  are  available  in 
online  specification  documents.  Using  open  protocols  may  seem  counter-intuitive 
but  it  allows  us  to  establish  benchmarks  for  comparison.  For  closed  or  proprietary 
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protocols,  open  source  projects  such  as  Bro,  jNetStream,  and  Wireshark  embody 
the  collective  reverse  engineering  efforts  of  their  contributors  [19,48,140].  We  select 
POP3  and  SMTP  because  they  are  textually  represented,  synchronous,  and  the  session 
boundaries  are  easily  detected. 

An  additional  reason  we  selected  POP3  is  that  with  only  10,960  TCP  connection 
attempts  on  port  110  it  is  relatively  low  volume  in  relation  to  SMTP  traffic  with 
126,545  TCP  connection  attempts.  This  allowed  us  to  evaluate  the  proof  of  concept 
software  on  a  lower  volume,  but  similarly  structured,  protocol  during  incremental 
development. 

4-3  Algorithm  Selection 

The  problem  domain  we  are  considering  has  the  following  characteristics:  we 
do  not  know  if  a  sample  is  positive  or  negative,  we  do  not  have  access  to  an  ora¬ 
cle  and  we  do  not  know  if  the  languages  under  consideration  are  characterizable  by 
regular  languages  or  subclasses  of  regular  languages.  By  constraining  the  scope  of 
the  problem  to  textually  represented  single-channel  protocols  using  TCP  transport 
on  IPv4  networks  (specifically  SMTP  and  POP3)  we  know  that  the  grammars  for  the 
protocol  languages  are  specified  English  language  and  in  a  context-free  format  (Aug¬ 
mented  BNF).  We  select  k- RI,  a  characterizable  method,  and  fc-TSSI,  an  incremental 
characterizeable  method  for  our  proof  of  concept  implementation. 


4-4  Experimental  Architecture 

Our  experimental  architecture  is  simplified  by  the  use  of  an  existing  data  set  and 
limiting  the  study  to  POP3  and  SMTP.  Because  we  are  using  an  existing  data  set  the 
network  trace  collection  is  already  determined.  Also,  both  POP3  and  SMTP  encapsu¬ 
late  a  complete  session  within  a  single  transport  level  TCP  connection.  This  reduces 
our  overall  experimental  architecture  to  protocol  format  (£)  recovery  and  transition 


function  (h)  recovery.  Figure  4.1  shows  the  two  components  that  we  implemented. 
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Figure  4.1:  Experimental  Architecture  Overview  -  protocol 

format  (£)  recovery  is  implemented  in  the  “Sample  and  Alpha¬ 
bet  Construction”  process.  The  low  level  implementation,  called 
flowtool ,  uses  hand  coded  POP3  and  SMTP  command  and  reply 
parsers  to  generate  an  alphabet  and  sample  strings.  Transition 
function  (h)  recovery  is  implemented  in  the  “Automata  Infer¬ 
ence”  process.  The  low  level  implementation,  called  flowmfer , 
executes  the  inference  algorithm  on  the  alphabet  and  sample 
strings  to  generate  an  automata  representation  of  the  protocols 
control  flow. 
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4-4-1  Network  Trace  Collection.  Network  trace  collection  for  IPv4  protocols 
is  well  covered  by  others.  For  our  experimental  architecture  we  are  using  existing 
network  traces  so  the  collection  architecture  was  pre-determined  by  the  data  set  under 
investigation.  The  data  set  used  is  the  IDEVAL  data  set  discussed  in  Appendix  [Aj 
The  trace  collection  architecture  used  to  build  the  IDEVAL  data  set  is  described 
in  [151]. 

4-4-2  Application  Level  Protocol  Data  Flow  Recovery.  Both  of  the  selected 
protocol’s  session  structures  are  encapsulated  in  single  TCP  connections.  This  makes 
extraction  of  application  level  session  data  flows  equivalent  to  extracting  TCP  con¬ 
nections.  Using  the  assumption  that  protocol  ports  are  accurate  for  the  IDEVAL 
data  set  we  de-multiplex  the  raw  data  to  extract  TCP  traffic  on  port  25  (SMTP) 
and  port  110  (POP3).  Finally,  we  merge  the  trace  hies  for  SMTP  and  POP3  traffic 
into  a  cumulative  data  hie.  The  pre-processing  workhow  is  detailed  in  Appendix  [A] 
Section  IA.4.21 


4-4-3  Protocol  Format  Recovery.  After  the  traces  were  extracted  we  used 
a  tool  we  developed  called  flowtool  to  create  alphabet  and  sample  strings  from  the 
cumulative  protocol  traces  and  weekly  protocol  traces. 


We  implemented  hand  coded  operator  parsers  for  our  proof  of  concept  imple¬ 
mentation.  Since  we  are  examining  open  protocols  we  were  able  to  use  the  speci¬ 
fication  documents  and  heuristics  from  Wireshark  and  Bro  to  implement  operator 
oriented  parsers.  We  processed  the  network  trace  hies  data  with  our  howtool  to  ex¬ 
tract  the  alphabet  and  sample  strings.  A  low  level  description  of  howtool  is  provided 
in  Appendix  [B]  Section  B.5.1| 


4-4-4  Protocol  Transition  Function  Recovery.  The  selected  inference  algo¬ 
rithm  is  executed  against  the  alphabet  and  sample  strings  created  by  howtool.  Be¬ 
cause,  we  are  using  linear  string  models  to  represent  samples  of  behavior  we  can  select 
from  a  range  of  existing  GI  algorithm  implementations.  We  select  simple  DFA  for 
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our  target  automata  representation.  Future  efforts  that  conduct  performance  analysis 
must  further  consider  the  choice  of  target  automata  representation  so  it  is  appropriate 
for  performance  analysis.  A  low  level  description  of  flowinfer  processing  is  provided 
in  Appendix  [B]  Section 

4-5  Limiting  Factors 

There  are  practical  and  theoretical  limitations  to  what  we  can  expect  to  achieve. 
Limiting  factors  include  our  session  detection  technique,  choice  of  offline  analysis,  and 
completeness  of  the  IDEVAL  data  set. 

4-5.1  Session  Detection.  Because  we  are  attempting  to  recover  the  control 
flow  from  only  samples  of  protocol  behavior  we  can  not  exactly  replicate  the  state 
transitions  caused  by  the  application  stack  of  the  distributed  system.  For  example 
the  state  transition  of  the  SMTP  sender  from  INITIAL  to  CONNJESTABLISHED 
can  be  inferred  by  the  TCP  connection  attempt  but  it  is  not  part  of  the  application 
level  protocol  (see  Figure  [T2}).  In  fact,  the  initial  state  must  be  inferred  by  the  state 
of  the  underlying  TCP  connection  and  the  final  state  determined  by  understanding 
both  the  operators  used  internally  and  the  state  of  the  TCP  connection. 

To  overcome  this  limitation  flowtool  adds  TCP  state  operators  to  the  alphabet 
and  sample  strings.  The  operators  added  are:  TCPopen,  TCPclose,  TCPreset,  TCP- 
timeout,  and  NIDSexit.  Where  TCPopen  denotes  the  initiation  of  a  TCP  connection 
and  TCPclose  the  normal  termination.  The  TCPreset  operator  denotes  termination  of 
the  TCP  connection  by  a  TCP  RST  while  TCPtimeout  denotes  the  TCP  connection 
has  timed  out.  Finally,  the  NIDSexit  operator  denotes  that  libnids  has  encountered 
the  end  of  the  trace  hie  before  the  TCP  connection  terminated.  This  indicates  that 
data  capture  was  terminated  before  the  data  how  was  completely  recorded  in  the 
trace  hie.  In  other  words,  the  trace  hie  is  incomplete. 


B.5.2 


70 


4-5.2  Online  vs.  Offline  Analysis.  Online  analysis  for  traffic  and  protocol 
characterization  was  conducted  by  [195,200,223].  Because  we  are  using  existing  data 
sets  we  will  concentrate  on  a  posteriori  offline  analysis  instead  of  online  analysis  of 
live  execution  traces.  A  summary  of  the  capture  file  characteristics  is  provided  in 
Appendix  [A]  The  weekly  file  characteristics  are  in  Section  A. 4. 2  The  characteristics 
of  the  individual  daily  network  capture  are  described  in  Section  A. 4.1 


4-5.3  Target  Automata  Representation.  There  are  several  representations 
that  we  can  consider  for  target  automata  representation.  Two  that  already  pro¬ 
vide  a  basis  for  analytical  backends  are  CFSM  and  MSC  as  previously  discussed 


in  Section  |2.8.2.2|  and  Section  |2.8.2.1[  Both  the  CFSM  and  MSC  representations 
have  characteristics  which  must  be  considered  before  choosing  one  over  the  other. 
CFSM  have  verification  techniques  including  reachability  and  reverse  reachability 
analysis  [196].  Deadlock  detection  techniques  are  also  available  [96].  MSC  can  also  be 
verified  through  process  of  realizability  but  only  for  bounded  sizes  [5].  Verifying  an 
unbounded  MSC  using  LTL  model  checking  is  in  general  undecidable  [5].  While  MSC 
might  be  useful  for  simple  protocol  automata  the  constraints  on  verifiability  cause  us 
to  favor  CFSM  representations  for  future  efforts. 

Although  both  CFSM  and  MSC  models  provide  more  powerful  representation 
we  select  DFA  for  simplicities  sake  in  our  proof  of  concept  implementation. 


4-5-4  Incomplete  Data.  Training  data  density  will  impact  analysis  [41],  We 
can  expect  only  partial  correctness  (approximate)  inference  results  if  the  input  does 
not  contain  a  representative  sample  of  the  protocol  commands  and  replies.  If  the 
data  set  does  not  contain  a  representative  sample  of  the  protocol  under  investigation 
(i.e.  the  data  is  sparse  or  noisy)  then  the  accuracy  of  the  inference  will  be  low. 
Unfortunately,  the  SMTP  and  POP3  traffic  in  the  IDEVAL  data  set  does  not  fully 
cover  the  allowed  operators  (E)  in  their  respective  specifications. 
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Tabic  4.1:  POP3  Command  Alphabet  Weekly  Overview  -  The  IDEVAL  data  set 
does  not  exercise  all  operators  allowed  by  the  POP3  specification. 


POP3  Command  Al] 

ahabet  Weekly  Overview 

Command 

Week  1 

Week  2 

Week  3 

Week  4 

Week  5 

Total 

STAT 

255 

236 

395 

303 

321 

1,510 

DELE 

402 

400 

752 

409 

455 

2,418 

USER 

255 

236 

395 

303 

323 

1,513 

UIDL 

0 

0 

0 

0 

0 

0 

QUIT 

255 

237 

395 

303 

323 

1,513 

TOP 

0 

0 

0 

0 

0 

0 

RETR 

402 

400 

752 

409 

455 

2,418 

RSET 

0 

0 

0 

0 

0 

0 

APOP 

0 

0 

0 

0 

0 

0 

LIST 

0 

0 

0 

0 

0 

0 

PASS 

255 

236 

395 

303 

321 

1,510 

NOOP 

0 

0 

0 

0 

0 

0 

4-5. 4-1  POP3  Alphabet  (E)  Overview.  For  POP3  we  discovered  no 
occurrences  of  the  following  operators  in  the  cumulative  data:  APOP,  LIST,  NOOP, 


RSET,  TOP,  UIDL.  Table  4.1  summarizes  the  command/operator  types  observed  in 
the  IDEVAL  data  set.  The  POP3  specification,  unlike  SMTP,  requires  all  command 
verbs  be  encoded  in  upper  case  [112].  The  POP3  specification  only  describes  two 
reply  codes  +OK  and  -ERR  of  which  we  observed  20,559  +OK  and  60  -ERR  replies. 
We  did  not  parse  the  replies  for  information  beyond  the  reply  code. 


4-5. 4-2  SMTP  Alphabet  (E)  Overview.  For  SMTP  we  discovered  no 
occurrences  of  the  following  operators  in  the  cumulative  data:  EXPN,  NOOP,  SEND, 
SOML,  SAML,  VRFY.  The  cumulative  counts  for  discovered  SMTP  operators  are 


shown  in  Table  |4.2[  It  must  be  noted  that  although  the  SMTP  has  a  rigid  syntax 
the  specification  allows  for  all  commands  and  replies  to  be  in  upper  case,  mixed  case, 


or  lower  case  [113,  Section  2.4],  As  shown  in  Table  4.2  we  observed  occurrences  of 
lower  case  commands  used  by  the  mailbomb  attack  but  no  occurrences  of  mixed  case 


commands.  The  SMTP  Reply  alphabet  occurrences  are  summarized  in  Table  4.4 
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Tabic  4.2:  SMTP  Cumulative  Command  Alphabet  -  The  IDEVAL  data  set  does 
not  exercise  all  operators  allowed  by  the  SMTP  specification. 


SMTP  Cumulative  Command  Alphabet 

Command 

Count 

Note 

MAIL 

118,029 

Initiate  a  mail  transaction 

mail 

1,672 

Lower  case  MAIL 

RSET 

22 

Abort  current  mail  transaction. 

DATA 

117,899 

Treat  all  lines  as  message  body  until  data 
is  terminated  by  <CR><LF>  .  <CR><LF> 

QUIT 

114,986 

Server  must  send  an  OK  reply  and  close 
the  transmission  channel 

EHLO 

112,131 

Initiate  session  (extended  format) 

HELO 

112,956 

Initiate  session 

RCPT 

186,548 

Identifies  an  individual  recipient;  multiple 
recipients  are  specified  by  multiple  occur¬ 
rences 

rcpt 

1,670 

lower  case  RCPT 

HELP 

3 

Server  sends  helpful  information  to  the 
client 

4-5.5  Noisy  Data.  Our  naive  assumption  that  the  port  used  by  TCP  connec¬ 
tion  level  traffic  would  indicate  the  type  of  encapsulated  application  level  data  proved 
wrong.  Because  the  IDEVAL  data  set  is  designed  for  intrusion  detection  system  per¬ 
formance  evaluation  it  contains  intentionally  generated  attack  traffic.  We  consider 
the  intentional  attack  traffic  to  be  noise  for  our  purposes.  The  specific  types  of  noise 
that  impact  the  alphabet  and  sample  string  creation  are  generated  by  mailbomb,  tcp- 
reset,  and  SYN  flood  attacks.  To  overcome  noise  generated  by  intentional  misuse  we 


developed  extrinsic  filtering  heuristics  discussed  in  Section  4.6  The  reader  is  referred 


to  [286]  for  detailed  descriptions  of  the  attacks.  Figure  4.2  shows  the  command  and 
reply  flow  for  an  attack. 


Accurate  recognition  of  the  protocol  in  the  trace  is  essential  to  the  accuracy  of 
the  k- RI  and  /c-TSSI  inference.  Both  algorithms  are  sensitive  to  noise  if  we  treat  all 
input  samples  as  positive  data.  Because  we  are  able  to  recognize  the  control  flow  of 
non-compliant  traffic  for  SMTP  and  POP3  we  can  automatically  label  each  sample 
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Table  4.3:  SMTP  Command  Alphabet  Weekly  Overview  -  The  IDEVAL  data  set 
does  not  exercise  all  operators  allowed  by  the  SMTP  specification.  Note  that  the 
Total  column  does  not  sum  the  Week  columns  due  to  incomplete  TCP  connection 
traces  in  the  weekly  pcap  files. 


SMTP  Command  Alphabet  Weekly  Overview 

Command 

Week  1 

Week  2 

Week  3 

Week  4 

Week  5 

Total 

HELO 

18,602 

20,044 

30,957 

21,318 

22,896 

112,956 

19,391 

20,765 

32,152 

22,365 

23,961 

117,777 

HELP 

0 

0 

0 

2 

1 

3 

SAML 

0 

0 

0 

0 

0 

0 

MAIL 

19,424 

20,858 

32,167 

22,391 

24,050 

118,029 

mail 

0 

753 

0 

905 

0 

1,672 

SOML 

0 

0 

0 

0 

0 

0 

EHLO 

18,244 

19,701 

30,420 

21,263 

22,570 

112,131 

QUIT 

18,666 

20,155 

31,206 

21774 

23,124 

114,900 

EXPN 

0 

0 

0 

0 

0 

0 

RSET 

0 

6 

0 

15 

1 

22 

VRFY 

0 

0 

0 

0 

0 

0 

SEND 

0 

0 

0 

0 

0 

0 

RCPT 

27,812 

31,688 

50,543 

36,882 

40,484 

186,548 

rcpt 

0 

753 

0 

905 

0 

1,670 

DATA 

19,424 

20,801 

32,167 

22,383 

23,985 

117,899 

NOOP 

0 

0 

0 

0 

0 

0 
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Table  4.4:  SMTP  Reply  Alphabet  Summary. 


SMTP  Reply  Alphabet  Summary 

Reply 

Count 

Note 

220 

126,266 

Indicates  beginning  of  session 

250 

565,331 

Indicates  last  operation  completed  success¬ 
fully 

221 

104,798 

Service  closing  transmission  channel 

354 

122,396 

Start  mail  input;  end  with 

<CR><LF> . <CR><LF> 

421 

416 

Service  not  available,  closing  transmission 
channel 

451 

209 

Requested  action  aborted:  local  error  in 
processing 

500 

107,073 

Syntax  error,  command  unrecognized 

503 

3 

Bad  sequence  of  commands 

551 

346 

User  not  local;  please  try  forward-path 

552 

2 

Requested  mail  action  aborted:  exceeded 
storage  allocation 

Table  4.5:  SMTP  Reply  Alphabet  Weekly  Overview. 


SMTP  Reply  Alphabet  Weekly  Overview 

Reply 

Week  1 

Week  2 

Week  3 

Week  4 

Week  5 

Total 

551 

0 

58 

0 

63 

225 

346 

503 

0 

0 

0 

2 

1 

3 

552 

0 

0 

0 

1 

1 

2 

421 

0 

0 

0 

215 

201 

416 

220 

19,424 

22,865 

32,170 

25,096 

27,573 

126,266 

221 

15,931 

19,828 

27,345 

21,607 

20,087 

104,798 

354 

19,424 

22,800 

32,167 

24,881 

23,985 

122,396 

451 

0 

0 

0 

106 

103 

209 

500 

17,422 

18,887 

29,208 

20,207 

21,416 

107,073 

250 

88,118 

101,824 

149,971 

113,555 

114,504 

565,331 
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220  marx.eyrie.af.mil  ESMTP  Sendmail  8. 8. 0/8. 8. 5;  Fri,  9  Apr  1999  17:27:18  -0400 
HAIL  From:  root@calvin.world.net 
RCPT  To:  kkendall@marx.eyrie.af.mil 
DATA 

Sender:  diweber@calvin.mit.edu 

From:  Dan  Weber  <djweber@sst.ll.mit.edu> 

MIME- Version:  1.0 

To:  djweber@calvin.mit.edu 

Subject:  Her  you  go. 

MIME- Version:  1.0 

Content-Type:  text/plain;  charset=us-ascii 
Content -Transfer- Encoding:  quoted -printable 


y . 

root@calvin.world.net...  Sender  ok 


bin/sh@-c@cp  /etc/passwd  /p;  printf  "woot: :0:0:woot:/:/bin/bash\ned: : 99:99: :/:/bin/sh\n"»  /etc/passwd; 

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: 

zzzzzzzzl234STAC=dO=aa=ff=bf= 

qwertyuiop= 

QUIT] 

HELP 

QUIT 


250  kkendall@marx.eyrie.af.mil...  Recipient  ok 

354  Enter  mail,  end  with  on  a  line  by  itself 

250  RAA04144  Message  accepted  for  delivery 

503  Need  MAIL  before  RCPT 

503  Need  MAIL  command 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

500  Command  unrecognized 

HELP 

500  Command  unrecognized 
QUIT 

221  marx.eyrie.af.mil  closing  connection 


Figure  4.2:  Wireshark  Following  Bad  SMTP  Session  -  the 

session  transcript  contains  asynchronous  commands  and  replies 
caused  by  a  buffer  overflow  injection  attack. 
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as  positive  or  negative.  This  allows  us  to  ignore  the  negative  samples  during  control 
flow  inference. 

4-5.6  Connection  Level  Protocol  Stack.  The  libnids  library  which  we  use  to 
handle  TCP  connection  level  packet  reassembly  and  defragmentation  is  an  additional 
limiting  factor.  The  library  interprets  TCP  connection  communication  using  a  mod¬ 
ified  TCP/IP  protocol  stack  from  the  Linux  version  2.0.x  kernel  [275].  This  means 
that  the  application  level  data  presented  to  our  flowtool  will  be  ordered  in  the  same 
manner.  This  problem  is  partially  exposed  by  non-SMTP  traffic  on  the  SMTP  port 
during  Weeks  4  and  Weeks  5  of  the  IDEVAL  data  set. 

4-6  Extrinsic  Heuristics  for  Noise  Filtering 

We  concentrated  first  on  direct  naive  implementation  of  our  format  extraction 
algorithms  followed  by  incremental  refinement.  Initial  pre-processing  runs  contained 
noise  caused  by  intentional  misuse  of  the  protocols  under  consideration.  We  used  the 
output  of  the  early  runs  to  develop  the  filtering  mechanism  that  automatically  labeled 
noise  sample  strings  as  negative  samples.  The  following  criteria  are  used  to  label  a 
sample  as  negative  (noise): 

Early  Termination  If  the  TCP  connection  terminates  without  application  level  pro¬ 
tocol  session  termination  the  sample  is  marked  negative.  This  means  any  sample 
ending  with  TCPreset,  TCPtimeout,  or  NIDSexit  is  marked  negative. 

No  Application  Data  An  empty  TCP  connection  without  application  level  traffic, 
that  is,  a  TCPopen  followed  immediately  by  TCPclose. 

Asynchronous  Command/Reply  If  the  composite  sample  indicates  asynchronous 
communication  the  sample  is  marked  negative.  Asynchronous  communication 
is  detected  when  a  command  follows  a  command,  a  reply  follows  a  reply,  or  the 
sample  contains  only  replies  or  commands. 
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Tabic  4.6:  Buffer  overflow  attacks  on  SMTP  server  at  172.16.114.50  -  The  victim 
server  is  described  as  an  internal  host  named  marx.eyrie.af.mil  running  Redhat  4.2 
kernel  2.0.27  [286].  Details  for  the  attacking  hosts,  152.204.232  and  202.49.244.10, 
are  not  provided  with  the  IDEVAL  data  set. 


Type 

Source  IP:port 

Start 

End 

Sample 

6 

152.204.232.193:1941 

923693227.228408 

923693238.320118 

TCPopen  220  250  250  250  503  HELP  500  221 
TCPclose 

32 

202.49.244.10:1027 

922715292.597701 

922715294.666466 

TCPopen  220  250  250  503  HELP  500  221 
TCPclose 

32 

202.49.244.10.1027 

922715290.284924 

922715292.352719 

TCPopen  220  250  250  503  HELP  500  221 
TCPclose 

Early  termination  was  observed  in  6,754  of  126,545  samples  for  SMTP  traffic 
and  8,192  of  10,960  samples  in  POP3  traffic.  As  shown  in  Table  A. 5  4,432  of  the  6,754 
SMTP  early  terminations  were  observed  during  week  5  of  the  IDEVAL  data  set.  A 
significant  portion,  8,090,  of  the  8,192  POP3  early  terminations  are  caused  by  TCP 
resets  during  week  5  (See  Table  A.  10  in  Appendix  |A| . 

An  asynchronous  sample  is  generated  by  the  mailbomb  attack  (sample  Type-41 


in  Table  A. 9):  TCPopen  220  mail  250  250  354  250  221  TCPclose.  The  mailbomb 
SMTP  communications  are  recognized  as  asynchronous  because  it  terminates  com¬ 
mands  with  <CR>  instead  of  a  standards  compliant  <CR><LF>.  The  SMTP  servers  are 
able  to  parse  the  non-compliant  communication  and  pass  back  compliant  replies  that 
are  detected  by  flowtool. 


Another  example  of  asynchronous  behavior  is  generate  by  a  buffer  overflow 
attack,  shown  in  Figure  |4~2|  Wireshark  shows  more  detail  than  our  flowtool  because  it 
interprets  <CR>  terminated  commands.  The  protocol  parsers  in  flowtool  only  recognize 
specification  compliant  <CR><LF>  terminated  commands.  The  attack  generates  one 
sample  of  Type-6  and  two  samples  of  Type-32  during  Week  4.  The  three  samples  are 
summarized  in  Table  14761 


While  the  SMTP  specification  does  not  require  synchronous  operation  it  does  re¬ 
quire  synchronous  communication.  Reply  codes  indicate  that  processing  is  underway 
and  every  command  must  generate  exactly  one  reply  [113,  Section  4.2],  Asynchronous 
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operation  is  specifically  permitted  during  session  termination  where  the  server  can 
send  a  421  reply  asynchronously  after  receiving  a  QUIT  [113,  Section  3.9]. 

Like  SMTP,  the  POP3  specification  requires  synchronous  communication.  The 
POP3  specification  requires  an  -ERR  response  to  any  unrecognized  or  invalid  com¬ 
mand  and  allows  the  server  to  automatically  terminate  a  session  after  10  minutes  of 
inactivity  [112]. 

Inference  Accuracy 

The  hypothesis  automata  might  over-restrict  or  over-generalize  the  actual  target 
automaton  of  the  system  under  consideration.  If  the  hypothesis  automata  is  over- 
restrictive  it  will  not  contain  states  and  transitions  that  are  necessary  to  accurately 
represent  the  target  automaton.  If  the  hypothesis  automata  over-generalizes  it  will 
contain  states  and  transitions  that  are  not  necessary  to  minimally  represent  the  target 
automaton.  To  accurately  quantify  the  over-restrictiveness  or  over-generalization  of 
the  inference  algorithm  we  must  know  a  priori  the  actual  automaton  of  the  protocol 
under  consideration.  This  is  without  regard  to  the  performance  characteristics  of  the 
target  automaton  and  language  class  membership. 

One  option  is  to  synthesize  an  approximate  target  automaton  directly  from  the 
protocol  specification.  Unfortunately,  the  specifications  for  the  specific  protocols  we 
are  considering,  SMTP  and  POP3,  are  provided  in  Request  For  Comment  (RFC) 
documents  as  English  language  descriptions  of  the  control  flow  and  Augmented  BNF 
descriptions  of  the  operator  formats.  This  is  problematic  because  the  English  language 
descriptions  of  control  flow  are  open  to  interpretation.  Additionally,  in  general  it  is 
undecidable  if  a  context-free  grammar  is  regular  [180,  Section  4],  While  a  regular 
language  (Type-3)  is  represetable  as  a  context-free  language  (Type-2)  the  inverse  does 
not  hold.  Furthermore,  transforming  a  context-free  grammar  that  generates  a  regular 
language  into  a  FSA  accepting  the  same  language  is,  in  general,  unsolvable  [180, 
Section  4], 
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Figure  4.3:  SMTP  Session  Initiation  -  SMTP  session  initiation 
starts  with  the  TCP  connection  from  the  client  to  the  server. 
The  server  replies  with  220  then  the  client  attempts  a  EHLO 
and  if  it  fails  HELO. 


Because  the  protocols  we  are  considering  are  described  in  English  language 
(control  portion)  and  context-free  Augmented  BNF  (data  portion)  we  manually  gen¬ 
erate  a  specification  “compliant”  DFA  representation  of  the  subset  of  the  specification 
commands  exercised  by  the  IDEVAL  data  set  for  comparison  purposes.  The  choice  of 
compliance  instead  of  conformance  is  intentional.  The  widely  accepted  Internet  proto¬ 
cols  described  in  RFC  documents,  unlike  ISO  OSI  protocols,  do  not  have  a  standards 
body  that  provides  test  suites  or  other  conformance  measurement  methodologies.  We 
limit  our  manually  generated  automata  to  the  happy  path  of  each  protocol.  That 
is,  we  do  not  include  all  possible  error  conditions  from  each  state  only  the  results  of 
successful  commands. 


While  the  complete  session  for  both  protocols  is  encapsulated  in  a  single  TCP 
connection  they  do  provide  for  session  initiation,  transaction,  and  session  termination 
stages.  The  separate  stages  for  SMTP  are  shown  in  Figure  T3(  Figure  |4~4  and  Fig¬ 
ure  |4Aj  Our  target  automaton  for  SMTP  has  19  states,  21  edges,  1  initial  state,  and 


1  final  state.  The  separate  stages  for  POP3  are  shown  in  Figure  4.6  Figure  4.7  and 
Figure  |4~8  Our  target  automaton  for  POP3  has  15  states,  15  edges,  1  initial  state, 
and  1  final  state. 


4-7.1  Inferred  POP3  Control  Flow.  Figures  for  the  automaton  generated 
for  POP3  by  fc-RI  and  h-TSSI  inference  for  k  values  of  1,  2  and  3  are  shown  in 
Appendix  [C] 
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RCPT 


DATA 


354 


250 


0  MAIL  ,.0h^5O_^O_ 


JICPT^ 

SMTP  Transaction 


Figure  4.4:  SMTP  Session  Transaction  -  A  transaction  stage 
is  started  when  the  client  sends  a  MAIL  command  followed  by 
one  or  more  RCPT  and  then  a  DATA  command  terminated  with 
a  period  on  a  line  by  itself. 


Figure  4.5:  SMTP  Session  Termination  -  Session  termination 
is  initiated  when  the  client  sends  a  QUIT  command  to  which  the 
server  replies  with  a  250  or  221  then  finally  the  TCP  connection 
should  be  closed. 


Q)  Tcp°pen  +0K  »Q  user  +0K  »(7)  PASS  +0K  »Q 

POP3  Initiation 


Figure  4.6:  POP3  Session  Initiation  -  POP3  session  initiation 
starts  with  the  TCP  connection  from  the  client  to  the  server. 
The  server  replies  with  +OK  then  the  client  attempts  authenti¬ 
cation  with  LISER  then  PASS  commands. 


POP3  Transaction 


Figure  4.7:  POP3  Session  Transaction  -  After  authentication 
the  transaction  stage  starts  which  allows  LIST,  RECV  followed 
by  DELT,  and  STAT  commands. 
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QUIT 


TC  Pc  lose 


POP3  Termination 


Figure  4.8:  POP3  Session  Termination  -  The  session  ends 

when  the  client  sends  a  QUIT  command  and  the  server  replies 
with  +OK  and  finally  the  TCP  connection  is  closed. 


Table  4.7:  k- R I  POP3  Composite  Automaton  Filtered. 


fc-RI  POP3  Composite  Automaton  Filtered 

k 

States 

Edges 

Initial 

Terminal 

Target 

15 

15 

1 

1 

PTA 

163 

162 

1 

18 

1 

15 

15 

1 

1 

2 

16 

17 

1 

1 

3 

18 

19 

1 

1 

4 

20 

21 

1 

1 

5 

22 

22 

1 

2 

6 

23 

24 

1 

2 

7 

25 

26 

1 

2 

8 

27 

28 

1 

2 

9 

29 

29 

1 

3 

10 

30 

31 

1 

3 

The  k- RI,  with  k  —  1.  inference  produced  an  automaton  that  exactly  matches 
the  target  automaton  for  the  subset  of  the  protocol  exercised  in  the  IDEVAL  data 
set  for  both  POP3.  The  fc-RI  inference  over-generalized  the  target  automaton  at  k 
values  of  2,  3,  4,  and  5.  The  A>TSSI  inference  over-restricted  the  inference  at  k  —  1, 
and  was  equivalent  to  the  k- RI  inference  for  k  values  of  2,  3,  4,  and  5.  The  number  of 


states  and  edges  for  POP3  fc-RI  inference  is  shown  in  Table  4/7  and  A;-TSSI  inference 
in  Table  14.81 


4-7.2  Inferred  SMTP  Control  Flow.  Ambiguities  in  the  specification  are 
exhibited  in  the  inferred  control  flow.  The  SMTP  RFC  allows  two  different  replies 
to  a  QUIT  command.  In  [113,  Section  4.1.1.10]  the  specification  states:  “This  com- 
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Table  4.8:  A>TSSI  POP3  Composite  Automaton  Filtered. 


fc-TSSI  POP3  Composite  Automaton  Filtered 

k 

States 

Edges 

Initial 

Terminal 

Target 

15 

15 

1 

1 

1 

10 

15 

1 

1 

2 

16 

17 

1 

1 

3 

18 

19 

1 

1 

4 

20 

21 

1 

1 

5 

22 

22 

1 

2 

6 

23 

24 

1 

2 

7 

25 

26 

1 

2 

8 

27 

28 

1 

2 

9 

29 

29 

1 

3 

10 

30 

31 

1 

3 

Table  4.9:  A;-RI  POP3  Composite  Automaton  Unfiltered. 


/c-RI  POP3  Composite  Automaton  Unfiltered 

k 

States 

Edges 

Initial 

Terminal 

Target 

15 

15 

1 

1 

PTA 

205 

204 

1 

30 

1 

47 

55 

1 

4 

2 

56 

59 

1 

11 

3 

60 

61 

1 

13 

4 

62 

63 

1 

13 

5 

64 

64 

1 

14 

6 

65 

66 

1 

14 

7 

67 

68 

1 

14 

8 

69 

70 

1 

14 

9 

71 

71 

1 

15 

10 

72 

73 

1 

15 
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Table  4.10:  fc-TSSI  POP3  Composite  Automaton  Unfiltered. 


&:-TSSI  POP3  Composite  Automaton  Unfiltered 

k 

States 

Edges 

Initial 

Terminal 

Target 

15 

15 

1 

1 

1 

14 

31 

1 

4 

2 

32 

38 

1 

11 

3 

39 

41 

1 

13 

4 

42 

44 

1 

13 

5 

45 

46 

1 

14 

6 

47 

49 

1 

14 

7 

50 

52 

1 

14 

8 

53 

55 

1 

14 

9 

56 

57 

1 

15 

10 

58 

60 

1 

15 

mand  specifies  that  the  receiver  MUST  send  an  OK  reply,  and  then  close  the  trans¬ 
mission  channel".  In  [113,  Section  4.2.2]  221  is  defined  as:  221  <domain>  Service 
closing  transmission  channel.  A  three  digit  reply  starting  with  2  indicates  pos¬ 
itive  complete,  the  second  digit  2  indicates  a  transmission  channel  and  5  indicates 
status  of  the  receiver  mail  system,  and  the  third  digit  is  used  to  indicate  finer  grain 
answers  [113,  Section  4.2.1],  The  lack  of  clarity  in  the  specification  is  reflected  in  the 
data  set.  The  server  named  hume  with  IP  number  172.16.112.100  replies  to  QUIT 
commands  with  250  while  others  reply  with  221. 


The  number  of  states  and  edges  for  SMTP  fc-RI  inference  is  shown  in  Tabic  4.11 
and  /c-TSSI  inference  in  Table  14.121 


4-8  Sensitivity  to  Noise 

The  selected  inference  algorithms  are  directly  sensitive  to  noise.  Any  sample 
string  created  by  our  format  recognition  that  is  not  a  member  of  the  protocol  under 
consideration  will  cause  over-generalization  of  the  target  automaton. 

One  SMTP  composite  sample  type  is  of  particular  interest  because  of  its  impact 
result  of  fc-RI  inference.  Type-41,  with  3  occurrences  contains  a  series  of  valid  com- 


84 


Table  4.11:  k-Rl  SMTP  Composite  Automaton  Filtered. 


/c-RI  SMTP  Composite  Automaton  Filtered 

k 

States 

Edges 

Initial 

Terminal 

Target 

19 

21 

1 

1 

PTA 

454 

453 

1 

36 

1 

60 

68 

1 

1 

2 

69 

76 

1 

2 

3 

77 

84 

1 

3 

4 

85 

92 

1 

4 

5 

93 

101 

1 

5 

6 

102 

111 

1 

6 

7 

112 

123 

1 

6 

8 

124 

136 

1 

7 

9 

137 

150 

1 

8 

10 

151 

165 

1 

9 

Table  4.12:  A;-TSSI  SMTP  Composite  Automaton  Filtered. 


&:-TSSI  SMTP  Composite  Automaton  Filtered 

k 

States 

Edges 

Initial 

Terminal 

Target 

19 

21 

1 

1 

1 

11 

18 

1 

1 

2 

19 

26 

1 

2 

3 

27 

36 

1 

4 

4 

37 

49 

1 

5 

5 

50 

65 

1 

7 

6 

66 

82 

1 

9 

7 

83 

97 

1 

13 

8 

98 

111 

1 

16 

9 

112 

124 

1 

19 

10 

125 

134 

1 

22 

85 


Table  4.13:  k-Rl  SMTP  Composite  Automaton  Unfiltered. 


&:-RI  SMTP  Composite  Automaton  Unfiltered 

k 

States 

Edges 

Initial 

Terminal 

Target 

19 

21 

1 

1 

PTA 

587 

586 

1 

99 

1 

85 

131 

1 

4 

2 

133 

155 

1 

30 

3 

157 

172 

1 

40 

4 

175 

187 

1 

45 

5 

188 

198 

1 

49 

6 

199 

210 

1 

51 

7 

211 

226 

1 

51 

8 

227 

244 

1 

55 

9 

245 

261 

1 

58 

10 

262 

286 

1 

60 

Table  4.14:  fc-TSSI  SMTP  Composite  Automaton  Unfiltered. 


fc-TSSI  SMTP  Composite  Automaton  Unfiltered 

k 

States 

Edges 

Initial 

Terminal 

Target 

19 

21 

1 

1 

1 

28 

90 

1 

4 

2 

91 

115 

1 

30 

3 

116 

132 

1 

40 

4 

133 

146 

1 

45 

5 

147 

158 

1 

49 

6 

159 

170 

1 

51 

7 

171 

186 

1 

51 

8 

187 

204 

1 

55 

9 

205 

221 

1 

58 

10 

222 

247 

1 

60 

Table  4.15:  A-RI  SMTP  Composite  Automaton  Type-41  removed  -  A;-RI  inference 
results  improve  when  3  occurrences  of  sample  Type-41  are  removed. 


fc-RI  SMTP  Composite  Automaton  Type-41  removed 

k 

States 

Edges 

Initial 

Terminal 

Target 

19 

21 

1 

1 

1 

19 

26 

1 

1 

2 

27 

33 

1 

2 

3 

34 

40 

1 

3 

4 

41 

47 

1 

4 

5 

48 

55 

1 

5 

6 

56 

64 

1 

6 

7 

65 

75 

1 

6 

8 

76 

87 

1 

7 

9 

88 

100 

1 

8 

10 

101 

114 

1 

9 

mand/reply  pairs  that  are  reset  by  the  RSET  command  (See  Appendix  [A]  Table  |A.9[) . 
The  k- RI  algorithm  is  unable  to  reduce  the  sequence  resulting  in  over-generalization. 
If  we  remove  the  3  samples  marking  them  as  negative  samples  fc-RI  inference  results 


improve  (See  Table  4.15  and  Table  4.11  On  the  other  hand,  fc-TSSI  inference  is  not 
impacted  by  Type-41  samples. 


4-9  Algorithm  Runtimes 

We  executed  the  k-Rl  and  fc-TSSI  algorithms  agains  the  composite  samples 
to  develop  an  approximate  understanding  of  the  runtime.  The  runtimes  presented 
are  specific  to  the  IDEVAL  data  set  and  the  execution  environment  used  for  analysis. 
They  should  NOT  be  interpreted  as  a  general  performance  indicator.  Both  algorithms 
were  executed  250  times  against  the  composite  samples  for  k  values  1  to  5.  The 
executables  were  compiled  with  GCC  version  3.3.6  with  optimizations  enabled  (-03). 
The  environment  used  was  openSUSE  version  10.3  running  on  a  Intel  Core  2  Duo 
T2500  operating  at  2.0  GHz  based  computer.  The  runtimes  reported  are  the  average 
of  250  executions. 
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Table  4.16: 


POP3  Composite  Runtimes  -  runtimes  are  in  seconds. 


POP3  Composite  Runtimes 

k 

/,-RI 

fc-TSSI 

1 

0.43084 

0.0258 

2 

0.4074 

0.01652 

3 

0.38852 

0.01792 

4 

0.1558 

0.01828 

5 

0.49132 

0.0196 

Table  4.17:  SMTP  Composite  Runtimes  -  runtimes  are  in  seconds. 


SMTP  Composite  Runtimes 

k 

/,-RI 

fc-TSSI 

1 

2.73996 

0.02196 

2 

3.02336 

0.019 

3 

4.20492 

0.0204 

4 

4.52704 

0.02156 

5 

4.7618 

0.0226 

4-10  Chapter  Summary 

We  presented  a  hybrid  application  of  state-of-practice  format  recognition  and 
grammatical  inference  of  protocol  control  flow  which  could  support  the  use  of  for¬ 
mal  techniques  to  identify  vulnerabilities  in  the  specification,  implementation,  and 
deployed  configuration  of  network  protocols.  We  outlined  our  experimental  design 
and  detailed  the  results  of  format  recovery  and  control  flow  recovery  for  POP3  and 
SMTP  protocol  traffic  from  the  IDEVAL  1999  data  set. 


V.  Analysis  and  Results 

Dynamic  protocol  reverse  engineering  is  a  challenging  problem  that  is  unlikely  to  yield 
significant  progress  without  research  across  a  broad  multi-disciplinary  range  of  topics. 
Here  we  present  experimental  results  of  our  proof  of  concept  implementation  and  our 
conclusions.  Finally  we  propose  areas  for  future  research. 

5. 1  Conclusions 

While  we  have  demonstrated  the  applicability  of  two  grammatical  inference  al¬ 
gorithms  for  two  specific  protocols  we  have  not  established  generality  of  the  approach. 

5.1.1  Experimental  Results.  The  k-Rl  algorithm  provided  accurate  inference 
of  POP3  control  flow  with  k  =  1  on  filtered  data.  The  A;-RI  algorithm  approximated 
our  target  automaton  for  SMTP  with  k  —  1  when  we  removed  samples  that  contained 
operators  not  included  in  our  target  automaton.  A;-TSSI  over-restricted  POP3  traffic 
with  k  —  1  and  overgeneralized  for  other  values  of  k.  &-TSSI  also  over-restricted 
SMTP  traffic  for  k  —  1  and  overgeneralized  for  other  values  of  k. 

5.1.2  Investigative  Questions  Answered.  The  focus  of  this  research  was  the 
evaluation  of  existing  Grammatical  Inference  algorithms  for  the  dynamic  protocol 
reverse  engineering  domain.  We  examined  the  following  questions  with  the  following 
results: 

[iQi]  What  information  is  necessary  to  reverse  engineer  the  control  portion  of 
application  layer  protocols  from  data  flows? 

A  network  trace  collection  architecture  must  be  constructed  that  is  able  to 
accurately  record  traces  of  the  protocol  traffic  with  out  loosing  samples.  Next,  we 
must  have  a  means  to  reconstruct  the  application  session.  Finally,  we  must  have 
access  to  the  format  of  protocol  operators  or  be  able  to  derive  the  operator  format. 

[IQ2]  Given  the  proven  [7,95]  difficulty  of  inferring  finite  automata  from  positive 
samples  only,  are  there  GI  approaches  that  are  appropriate  for  reverse  engineering 


automata  representations  of  the  control  portion  of  application  layer  protocols  from 
data  flows? 

For  the  specific  protocols  we  examined  the  answer  is  yes  within  limits  of  data 
completeness.  Both  k-Rl  and  fc-TSSI  inference  were  able  to  infer  control  flow  that 
fit  the  target  automata  for  POP3  and  closely  approximated  the  target  automata  for 
SMTP  data  observed  in  the  IDEVAL  data  set.  Unfortunately,  the  IDEVAL  data 
set  does  not  completely  exercise  either  protocol  resulting  in  incomplete  automata. 
Finally,  we  have  not  established  the  generality  of  the  approach  and  must  leave  this 
as  an  open  question. 

5. 2  Future  Work 

This  thesis  presented  a  grammatical  inference  approach  to  reverse  engineering 
models  of  protocol  control  flow  from  network  traces.  This  is  an  initial  step  in  gener¬ 
ating  tactical  cyber  weapons  that  target  computer  network  systems.  Future  research 
could  evaluate  the  following  areas: 

1.  Model  recovery  of  other  classes  of  protocols  such  as:  asynchronous,  binary  rep¬ 
resented,  multi-connection  and  multi-channel  protocols  from  network  traces. 

2.  Model  recovery  of  higher  order  automata  such  as  context-free  grammars,  context- 
sensitive  grammars. 

3.  Model  recovery  from  other  families  of  protocols.  While  we  concentrated  on 
a  subset  of  application  level  protocols  on  IPv4  networks  similar  experimental 
analysis  could  be  conducted  against  other  classes  of  protocols,  such  as  SCADA 
or  SS7,  for  vulnerability  assessment  and  generation  of  targeted  effects. 

4.  Online,  live,  and  incremental  model  recovery.  The  experimental  structure  evalu¬ 
ated  in  this  thesis  requires  the  full  (non-incremental)  construction  of  the  sample 
space.  The  k- RI  and  /c-TSSI  algorithms  evaluated  could  support  incremental 
modifications. 
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5.  Formal  analysis  of  automatically  generated  models  Q  Ammons  presents  formal 
analysis  of  specifications  automatically  generated  from  observations  of  instru¬ 
mented  applications  [6].  Dallmeier  examines  the  discovery  of  normal  program 
behavior  [59].  Bishop  [30]  studied  automated  specification  discovery  at  the 
packet  level  of  granularity.  It  could  be  beneficial  to  examine  automatically  cre¬ 
ated  protocol  specifications  for  implementation  issues  that  allow  deliberately 
crafted  packets  that  lead  a  protocol  parser  to  conditions  that  are  unexpected. 

6.  Consolcliation  of  the  well  known  grammatical  inference  algorithms  into  an  open 
source  analysis  framework  like  Weka  [274]  or  Rapidminer  [171],  or  modeling 
framework  like  Ptolemy  [144]  might  benefit  the  machine  learning  community. 
The  Mical  [213]  and  Learnlib  [211]  projects  present  frameworks  which  implement 
several  GI  algorithms.  Algorithms  could  be  gleaned  from  other  research  efforts 
(e.g.  [46,169]). 

7.  Examine  other  automated  or  semi-automated  approaches  to  discovering  protocol 
defects  such  as  RCE  or  randomized  boundary  testing^} 

8.  Examine  cryptographic  protocol  verification  methodologies  for  formalisms  that 
can  be  adopted  to  protocol  reverse  engineering  in  general.  Dongxi  [66],  for  exam¬ 
ple,  proposes  an  automatic  attack  construction  algorithm  to  discover  potential 
attacks  on  cryptographic  security  protocols. 

9.  Further  examine  meta-heuristic  techniques  such  as  tabu  search  or  randomized 
techniques  like  genetic  algorithms  for  their  applicability  to  inference  of  protocol 
control  flow. 

10.  Construction  of  a  publicly  releaseable  research  data  set  containing  contempo¬ 
rary  network  traffic.  Limitations  of  the  IDEVAL  data  set  used  in  this  research 
are  discussed  in  Appendix [Aj  It  would  be  beneficial  to  the  network  research  com- 

1  [67, 273]  provide  overviews  of  formal  analysis. 

2Commonly  referred  to  as  fuzzing  [249,  Chapter  14]  and  [175,188]. 
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munity  as  a  whole  to  develop  a  data  set  which  addresses  the  concerns  presented 
by  [166]. 

11.  Formalize  results  of  inference  mechanisms.  While  we  presented  limited  empirical 
evidence  that  grammatical  inference  algorithms  could  be  applied  to  the  problem 
domain  under  consideration  we  did  not  provide  formal  proof  of  the  performance 
characteristics. 

12.  Examine  other  language  models  besides  linear  systems  built  from  strings.  [224] 
presents  extensions  of  formal  languages  for  multi-dimensional  objects  such  as 
trees  and  graphs  or  Clark’s  planar  languages  [45,46]. 

5. 3  Summary 

Ultimately,  the  grammatical  inference  approach  presented  only  provides  infor¬ 
mation  that  can  assist  an  informed  human  analyst  in  protocol  reverse  engineering.  The 
analyst  will  still  have  to  apply  common  heuristics  (e.g.  identifying  signpost  values, 
block  structure  inference,  or  windowed  entropy).  We  have  provided  limited  empir¬ 
ical  evidence  that  our  grammatical  inference  approach  to  dynamic  protocol  reverse 
engineering  is  applicable  to  the  protocol  reverse  engineering  problem  domain.  This 
approach  to  control  flow  recovery  should  be  further  developed  to  support  automated 
analysis  of  inferred  control  flow  for  performance  characteristics. 
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Appendix  A.  Data 

This  appendix  describes  the  data  set  used  in  this  thesis  and  provides  samples  of  the 
application  level  protocol  traces  under  consideration. 

A.l  Natural  Data  Sets 

The  Internet  Traffic  Archive  (ITA)  is  a  moderated  repository  to  support  widespread 
access  to  traces  of  Internet  network  traffic,  sponsored  by  ACM  SIGCOMM  [139].  A 
data  set  was  previously  offered  by  the  Passive  Measurement  and  Anallysis  (PMA) 
project  of  NLANR  and  now  by  CADIA  provides  header  traces  from  OC3  through 
OC48  speeds  [167].  Unfortunately,  the  ITA  and  NLANR/PAM  data  sets  were  to  nar¬ 
rowly  focused  for  our  research  efforts  and  did  not  contain  application  level  protocol 
traces. 

A. 2  Artificial  Data  Sets 

While  the  ITA  and  NLANR/PAM  data  sets  draw  from  real  world  traffic  we 
also  considered  the  use  of  artificial  data  sets.  Various  authors  have  proposed  or 
constructed  data  sets  of  network  traces  appropriate  for  their  area  of  studies  [58,215]. 
The  LARIAT  system  [222]  was  considered  for  generation  of  application  level  protocol 
traces.  Regrettably,  we  were  not  able  to  gain  access  to  a  working  LARIAT  system 
early  enough  to  generate  appropriate  data  sets. 

A. 3  DARPA  Intrusion  Detection  Evaluation  data  set 

The  DARPA  Intrusion  Detection  Evaluation  data  set  is  available  from  the  Mas¬ 
sachusetts  Institute  of  Technology  (MIT)  Lincoln  Laboratory.  The  data  set  was  pro¬ 
duced  by  the  Information  Systems  Technology  (1ST)  Group  of  MIT  Lincoln  Labo¬ 
ratory  under  Defense  Advanced  Research  Projects  Agency  (DARPA)  and  Air  Force 
Research  Laboratory  (AFRL)  sponsorship  [286].  The  data  set  provides  examples  of 
attacks  and  background  traffic.  More  importantly  for  this  research  it  provides  sim¬ 
ulation  of  user  generated  traffic  of  ASCII  text  represented  single-channel  POP3  and 
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SMTP  protocol  traffic.  POP3  traffic  involved  internal  users  accessing  external  mail 
servers  [151].  SMTP  traffic  was  comprised  of  individual,  group  and  global  email  mes¬ 
sages  to  and  from  all  simulated  users  [151].  Like  Mahoney  we  will  refer  to  the  data 
set  as  1DEVAL  [155].  IDEVAL  is  both  publicly  available  and  widely  used  in  research 
efforts. 


A. 3.1  IDEVAL  Data  Quality.  While  the  IDEVAL  is  widely  used  for  evalu¬ 
ation  of  intrusion  detection  algorithms  and  systems  there  has  been  some  concern  ex¬ 
pressed  about  how  accurately  the  data  set  represents  more  contemporary  TCP/IP  net¬ 
work  activity  [155].  In  his  assessment  McHugh  even  questions  the  collection  method¬ 
ology,  attack  taxonomy  and  low  traffic  rates  (among  other  characteristics)  [166].  Ma¬ 
honey  and  Chan  analyzed  the  data  set  for  simulation  artifacts  concluding  that  the 
data  set  lacked  real-world  ranges  in  the  packet  parameters  (i.e.  TTL,  TCP  flags,  TCP 
windows  size)  [155].  Furthermore  the  data  set  lacks  real- world  traffic  crud  caused  by 
incorrect  implementations  of  the  TCP/IP  protocols  [119,155]. 


A. 3. 2  IDEVAL  Data  Relevance.  The  IDEVAL  network  configuration  does 
not  reflect  contemporary  hardware,  software,  or  operating  systems.  Operating  sys¬ 
tems  used  include  MacOS,  Redhat  5.0  kernel  2.0.32,  Redhat  5.2  kernel  2.0.36,  Solaris 
2.5.1,  Solaris  2.6,  SunOS  4.1.4,  Windows  3.1,  Windows  95,  and  Windows  NT  4.0  build 
1381  Service  Pack  1  [286].  None  of  these  operating  systems  are  currently  authorized 
for  use  in  DOD  networks.  Dialects  of  network  protocols  are  manifested  in  implemen¬ 
tation  specific  interpretations  and  extensions  of  protocol  specifications.  Given,  the 
dated  OS  protocol  stacks  used  to  generate  traffic  the  IDEVAL  data  set  might  not 
precisely  reflect  current  network  traffic.  Additionally,  the  topology  of  the  network, 
shown  in  Figure  pVT  is  not  representative  of  contemporary  networks. 


A. 4  Data  Files 

The  characteristics  of  the  data  hies  used  are  summarized  in  IA.1I  and  IA.21  The 


characteristics  merged  data  are  summarized  in  Table  A. 3  and  Table  A. 4  The  infor- 
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Figure  A.l:  IDEVAL1999  Network  Topology  [286]. 


mation  was  generated  using  the  capinfos  utility  that  accompanies  Wireshark.  The 


capinfo  utility  reported  errors  for  the  following  hies: 


wk5.day2. outside. pcap  An  error  occurred  after  reading  2,558,481  packets.  Less 
data  was  read  than  expected. 

wk5.day3. outside. pcap  An  error  occurred  after  reading  1,385,130  packets.  Less 
data  was  read  than  expected. 

wk5.day4. outside. pcap  An  error  occurred  after  reading  2,308,273  packets.  Less 
data  was  read  than  expected. 

wk5.day5. outside. pcap  An  error  occurred  after  reading  2,651,589  packets.  Less 
data  was  read  than  expected. 

Additionally,  the  hie  wk2.day4. inside. pcap  was  not  included  in  the  data  set  we 

used. 


A. 4-1  Complete  Data  File  set. 
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Table  A.l:  IDEVAL  Data  Files  Size  -  Size  columns  are  measured  in  bytes.  The  Data  Rate 
is  measured  in  bytes/second. 


IDEVAL  Data  Files  Size  j 

Name 

Packets 

File  Size 

Data  Size 

Data  Rate  (bytes/s)  | 

Avg  Size  | 

Week  1  | 

wkl.  day  1.  inside,  pcap 

1,492,331 

341,027,537 

317,150,217 

4003.90 

212.52 

wkl.  day  1.  outside,  pcap 

1,362,869 

323,832,360 

302,026,432 

3813.47 

221.61 

wkl.  day  2.  inside,  pcap 

1,237,119 

341,401,548 

321,607,620 

4060.92 

259.96 

wkl  .day2.ouside.pcap 

1,157,328 

325,395,277 

306,878,005 

3874.78 

265.16 

wkl  .day3. inside. pcap 

1,726,319 

385,142,370 

357,521,242 

4514.33 

207.10 

wkl.  day3.  outside,  pcap 

1,616,713 

368,776,477 

342,909,045 

4329.76 

212.10 

wkl  .day4. inside. pcap 

1,947,815 

552,903,806 

521,738,742 

6588.39 

267.86 

wkl. day4. outside. pcap 

1,807,060 

517,042,040 

488,129,056 

6163.26 

270.12 

wkl.  day  5.  inside,  pcap 

1,483,419 

308,604,831 

284,870,103 

3596.96 

192.04 

wkl.  day  5.  outside,  pcap 

1,349,635 

284,774,805 

263,180,621 

3323.00 

195.00 

Week  2  | 

wk2.  day  1.  inside,  pcap 

1,753,377 

401,046,958 

372,992,902 

4709.84 

212.73 

wk2.  day  1.  outside,  pcap 

1,337,777 

329,322,084 

307,917,628 

3885.50 

230.17 

wk2.  day  2.  inside,  pcap 

1,585,120 

400,104,805 

374,742,861 

5462.75 

236.41 

wk2.  day  2.  outside,  pcap 

1,454,035 

375,798,588 

352,534,004 

5154.17 

242.45 

wk2.day3.  inside,  pcap 

1,011,149 

169,156,383 

152,977,975 

1931.66 

151.29 

wk2.day3.  outside,  pcap 

888,139 

145,698,730 

131,488,482 

1660.24 

148.05 

wk2.day4.  outside,  pcap 

1,412,645 

330,867,682 

308,285,665 

3892.62 

218.23 

wk2.  day  5.  inside,  pcap 

1,362,422 

291,511,690 

269,712,914 

3405.61 

197.97 

wk2.  day  5.  outside,  pcap 

1,252,412 

273,295,370 

253,256,754 

3197.75 

202.22 

Week  3  1 

wk3.  day  1.  inside,  extra,  pcap 

1,679,048 

233,849,898 

206,985,106 

2709.36 

123.28 

wk3.  day  1.  inside,  pcap 

2,106,744 

468,024,334 

434,316,406 

5484.02 

206.16 

wk3. day  1. outside. extra. pcap 

1,191,358 

150,014,497 

130,952,745 

1712.91 

109.92 

wk3.  day  1.  outside,  pcap 

1,542,614 

371,123,625 

346,441,777 

4374.34 

224.58 

wk3.  day  2.  inside,  extra,  pcap 

2,152,964 

460,059,143 

425,611,695 

5387.50 

197.69 

wk3.  day  2.  inside,  pcap 

1,831,648 

414,885,615 

385,579,223 

4868.70 

210.51 

wk3.  day  2.  outside,  extra,  pcap 

1,822,764 

403,648,042 

374,483,794 

4728.46 

205.45 

wk3.  day  2.  outside,  pcap 

1,374,431 

334,280,722 

312,289,802 

3943.11 

227.21 

wk3.day3.  inside,  pcap 

1,849,753 

558,991,635 

529,395,563 

6684.58 

286.20 

wk3.day3.ouside.pcap 

1,760,859 

540,109,859 

511,936,091 

6464.05 

290.73 

wk3.day3.  outside,  extra,  pcap 

2,453,966 

766,843,295 

727,579,815 

9186.76 

296.49 

wk3.day4.  inside,  pcap 

1,559,156 

260,180,866 

235,234,346 

3235.69 

150.87 

wk3.day4.  outside,  pcap 

1,096,660 

183,158,763 

165,612,179 

2277.96 

151.02 

wk3.  day  5.  inside,  pcap 

1,635,425 

513,197,145 

487,030,321 

7939.98 

297.80 

Week  4  1 

wk4. day  1. inside. pcap 

1,647,573 

285,359,948 

258,998,756 

3270.39 

157.20 

wk4. day  1. outside. pcap 

1,279,543 

216,724,852 

196,252,140 

2478.00 

153.38 

wk4.  day  2.  outside,  pcap 

1,309,242 

301,682,860 

280,734,964 

3544.68 

214.43 

wk4.day3.  inside,  pcap 

1,766,074 

399,300,104 

371,042,896 

4685.55 

210.09 

wk4.day3.  outside,  pcap 

1,315,032 

319,141,540 

298,101,004 

3764.00 

226.69 

wk4.day4.  inside,  pcap 

2,356,503 

519,183,790 

481,479,718 

6080.20 

204.32 

wk4.day4.  outside,  pcap 

1,635,267 

399,619,424 

373,455,128 

4715.46 

228.38 

wk4.  day  5.  inside,  pcap 

1,945,538 

368,018,512 

336,889,880 

4254.06 

173.16 

wk4.  day  5.  outside,  pcap 

1,318,345 

262,141,472 

241,047,928 

3043.60 

182.84 

Week  5  | 

wk5.  day  1.  inside,  pcap 

2,291,319 

477,303,765 

440,642,637 

5564.14 

192.31 

wk5.  day  1.  outside,  pcap 

1,376,598 

344,257,810 

322,232,218 

4068.72 

234.08 

wk5.  day  2.  inside,  pcap 

3,404,824 

524,283,553 

469,806,345 

5932.00 

137.98 

wk5.day3.  inside,  pcap 

2,087,942 

491,350,468 

457,943,372 

5782.72 

219.33 

Continued  on  next  page  | 
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Table  A.l  —  continued  from  previous  page 


Name 

Packets 

File  Size  (bytes) 

Data  Size  (bytes) 

Data  Rate  (bytes/s) 

Avg  Size 

wk5.day4.  inside,  pcap 

3,201,381 

826,909,800 

775,687,680 

9794.98 

242.30 

wk5.  day  5.  inside,  pcap 

3,393,918 

1,093,706,789 

1,039,404,077 

13124.91 

306.25 

Table  A. 2:  IDEVAL  Data  Files  Time  -  Duration  is  in  seconds.  Start  and  End  are  specified 
in  Unix  epoch  time  format.  That  is,  they  are  specified  in  seconds  since  00:00:00  UTC  on 
January  1,  1970. 


|  IDEVAL  Data  Files  Time 

|  Name 

Duration 

Start  Time 

Start 

End  Time 

End 

1  Week  1 

wkl.dayl. inside. pcap 

79210.265570 

Mon  Mar  1  08:00:05  1999 

920293205 

Tue  Mar  2  06:00:16  1999 

920372416 

wkl.  day  1.  outside,  pcap 

79199.855007 

Mon  Mar  1  08:00:02  1999 

920293202 

Tue  Mar  2  06:00:02  1999 

920372402 

wkl.day2.  inside,  pcap 

79195.839551 

Tue  Mar  2  08:00:01  1999 

920379601 

Wed  Mar  3  05:59:57  1999 

920458797 

wkl.  day2.ouside.  pcap 

79198.903380 

Tue  Mar  2  08:00:02  1999 

920379602 

Wed  Mar  3  06:00:01  1999 

920458801 

wkl.  day3.  inside,  pcap 

79196.915752 

Wed  Mar  3  08:00:01  1999 

920466001 

Thu  Mar  4  05:59:58  1999 

920545198 

wkl.  day3.  outside,  pcap 

79198.122434 

Wed  Mar  3  08:00:03  1999 

920466003 

Thu  Mar  4  06:00:01  1999 

920545201 

wkl.  day4.  inside,  pcap 

79190.607508 

Thu  Mar  4  08:00:01  1999 

920552401 

Fri  Mar  5  05:59:52  1999 

920631592 

wkl.  day4.  outside,  pcap 

79199.798770 

Thu  Mar  4  08:00:03  1999 

920552403 

Fri  Mar  5  06:00:02  1999 

920631602 

wkl.  day5.  inside,  pcap 

79197.368381 

Fri  Mar  5  08:00:01  1999 

920638801 

Sat  Mar  6  05:59:58  1999 

920717998 

wkl.  day5.  outside,  pcap 

79199.788688 

Fri  Mar  5  08:00:02  1999 

920638802 

Sat  Mar  6  06:00:02  1999 

920718002 

1  Week  2 

wk2.  day  1.  inside,  pcap 

79194.416159 

Mon  Mar  8  08:00:00  1999 

920898000 

Tue  Mar  9  05:59:54  1999 

920977194 

wk2.  day  1.  outside,  pcap 

79247.843302 

Mon  Mar  8  08:00:01  1999 

920898001 

Tue  Mar  9  06:00:49  1999 

920977249 

wk2.day2.  inside,  pcap 

68599.660226 

Tue  Mar  9  08:00:01  1999 

920984401 

Wed  Mar  10  03:03:21  1999 

921053001 

wk2.day2.  outside,  pcap 

68397.777830 

Tue  Mar  9  08:00:01  1999 

920984401 

Wed  Mar  10  02:59:59  1999 

921052799 

wk2.day3.  inside,  pcap 

79194.939151 

Wed  Mar  10  08:00:02  1999 

921070802 

Thu  Mar  11  05:59:57  1999 

921149997 

wk2.day3.  outside,  pcap 

79198.605427 

Wed  Mar  10  08:00:03  1999 

921070803 

Thu  Mar  11  06:00:01  1999 

921150001 

wk2.day4.  outside,  pcap 

79197.414198 

Thu  Mar  11  08:00:03  1999 

921157203 

Fri  Mar  12  06:00:00  1999 

921236400 

wk2.day5.  inside,  pcap 

79196.548757 

Fri  Mar  12  08:00:01  1999 

921243601 

Sat  Mar  13  05:59:58  1999 

921322798 

wk2.day5.  outside,  pcap 

79198.411013 

Fri  Mar  12  08:00:02  1999 

921243602 

Sat  Mar  13  06:00:00  1999 

921322800 

1  Week  3 

wk3. day  1. inside. extra. pcap 

76396.316028 

Mon  Mar  22  08:00:02  1999 

922107602 

Tue  Mar  23  05:13:19  1999 

922183999 

wk3. day  1. inside. pcap 

79196.697590 

Mon  Mar  15  08:00:01  1999 

921502801 

Tue  Mar  16  05:59:58  1999 

921581998 

wk3. day  1. outside. extra. pcap 

76450.306697 

Mon  Mar  22  08:00:03  1999 

922107603 

Tue  Mar  23  05:14:14  1999 

922184054 

wk3.  day  1.  outside,  pcap 

79198.682477 

Mon  Mar  15  08:00:02  1999 

921502802 

Tue  Mar  16  06:00:00  1999 

921582000 

wk3.day2.  inside,  extra,  pcap 

78999.815637 

Tue  Mar  23  08:00:02  1999 

922194002 

Wed  Mar  24  05:56:42  1999 

922273002 

wk3.day2.  inside,  pcap 

79195.474873 

Tue  Mar  16  08:00:00  1999 

921589200 

Wed  Mar  17  05:59:55  1999 

921668395 

wk3.day2.  outside,  extra,  pcap 

79197.824660 

Tue  Mar  23  08:00:00  1999 

922194000 

Wed  Mar  24  05:59:58  1999 

922273198 

wk3.day2.  outside,  pcap 

79198.800883 

Tue  Mar  16  08:00:01  1999 

921589201 

Wed  Mar  17  06:00:00  1999 

921668400 

wk3.day3.  inside,  pcap 

79196.540665 

Wed  Mar  17  08:00:01  1999 

921675601 

Thu  Mar  18  05:59:58  1999 

921754798 

wk3.day3.ouside.pcap 

79197.391311 

Wed  Mar  17  08:00:03  1999 

921675603 

Thu  Mar  18  06:00:00  1999 

921754800 

wk3.day3.  outside,  extra,  pcap 

79198.759438 

Wed  Mar  24  08:00:01  1999 

922280401 

Thu  Mar  25  06:00:00  1999 

922359600 

wk3.day4.  inside,  pcap 

72699.913792 

Thu  Mar  18  08:00:03  1999 

921762003 

Fri  Mar  19  04:11:42  1999 

921834702 

wk3.day4.  outside,  pcap 

72702.007896 

Thu  Mar  18  08:00:02  1999 

921762002 

Fri  Mar  19  04:11:44  1999 

921834704 

wk3.day5.  inside,  pcap 

61338.967582 

Fri  Mar  19  08:00:02  1999 

921848402 

Sat  Mar  20  01:02:21  1999 

921909741 

1  Week  4 

wk4.  day  1.  inside,  pcap 

79195.171194 

Mon  Mar  29  08:00:02  1999 

922712402 

Tue  Mar  30  05:59:57  1999 

922791597 

wk4.  day  1.  outside,  pcap 

79197.929511 

Mon  Mar  29  08:00:03  1999 

922712403 

Tue  Mar  30  06:00:01  1999 

922791601 

wk4.day2.  outside,  pcap 

79198.923458 

Tue  Mar  30  08:00:02  1999 

922798802 

Wed  Mar  31  06:00:01  1999 

922878001 

wk4.day3.  inside,  pcap 

79188.757556 

Wed  Mar  31  08:00:09  1999 

922885209 

Thu  Apr  1  05:59:57  1999 

922964397 

|  Continued  on  next  page  | 

97 


Table  A. 2  —  continued  from  previous  page 


Name 

Duration 

Start  Time 

Start 

End  Time 

End 

wk4.day3.  outside,  pcap 

79197.915230 

Wed  Mar  31  08:00:02  1999 

922885202 

Thu  Apr  1  06:00:00  1999 

922964400 

wk4.day4.  inside,  pcap 

79188.179768 

Thu  Apr  1  08:00:01  1999 

922971601 

Fri  Apr  2  05:59:49  1999 

923050789 

wk4.day4.  outside,  pcap 

79198.095408 

Thu  Apr  1  08:00:03  1999 

922971603 

Fri  Apr  2  06:00:01  1999 

923050801 

wk4.  day  5.  inside,  pcap 

79192.491900 

Fri  Apr  2  08:00:00  1999 

923058000 

Sat  Apr  3  05:59:53  1999 

923137193 

wk4.day5.  outside,  pcap 

79198.366773 

Fri  Apr  2  08:00:01  1999 

923058001 

Sat  Apr  3  06:00:00  1999 

923137200 

Week  5 

wk5.dayl. inside. pcap 

79193.374094 

Mon  Apr  5  08:00:02  1999 

923313602 

Tue  Apr  6  05:59:56  1999 

923392796 

wk5.  day  1.  outside,  pcap 

79197.375906 

Mon  Apr  5  08:00:03  1999 

923313603 

Tue  Apr  6  06:00:00  1999 

923392800 

wk5 .  day  2 .  inside  .pcap 

79198.608846 

Tue  Apr  6  08:00:00  1999 

923400000 

Wed  Apr  7  05:59:58  1999 

923479198 

wk5.day3.  inside,  pcap 

79191.650256 

Wed  Apr  7  08:00:00  1999 

923486400 

Thu  Apr  8  05:59:52  1999 

923565592 

wk5 .  day  4 .  inside  .pcap 

79192.349066 

Thu  Apr  8  08:00:00  1999 

923572800 

Fri  Apr  9  05:59:53  1999 

923651993 

wk5 .  day5 .  inside  .pcap 

79193.248408 

Fri  Apr  9  08:00:04  1999 

923659204 

Sat  Apr  10  05:59:58  1999 

923738398 

A. 4-2  Merged  Data  File  set.  The  choice  of  offline  analysis  allowed  us  to 
filter  the  raw  IDEVAL  data  set  to  include  only  the  POP3  and  SMTP  protocol  traffic. 
We  used  the  command  line  interface  to  Wireshark,  called  tshark,  to  Elter  out  the 
protocols  under  consideration  from  each  of  the  daily  capture  hies.  For  example: 


tshark  -R  "tcp.port  eq  25  or  tcp.port  eq  110" 

-r  <input  file  name> 

-w  <output  file  name> 

extracts  SMTP  and  POP3  traffic  from  the  <input  file  name>  and  writes  it  to  the 
<output  file  name>. 

Next  we  used  mergecap,  a  command  line  utility  also  distributed  with  Wireshark, 
to  merge  the  filtered  data  into  weekly  summary  hies. 


mergecap  -F  libpcap  -w  wkl.pcap  'find  .  -iname  "wkl . * . filtered. pcap" ' 
Finally,  we  merged  the  weekly  hies  into  a  cumulative  trace  hie  named  total. pcap. 
The  characteristics  merged  data  are  summarized  in  Table  |A.3|  and  Table  |A.4 


The  pre-processing  workhow  is  shown  in  Figure  [AA]  Note  that  the  cumulative  alpha¬ 
bet  is  input  into  weekly  sample  extractions  to  ensure  that  the  sample  strings  produced 
use  a  common  alphabet. 


99 


Table  A. 3:  Merged  Data  Files  Size  -  Size  columns  are  measured  in  bytes.  The  Data  Rate 
is  measured  in  bytes/second. 


|  Merged  Data  Files  Size 

Name 

Packets 

File  Size 

Data  Size 

Data  Rate  (bytes/s) 

Avg  Size 

wkl.pcap 

619,215 

107,546,006 

97,638,542 

230.02 

157.68 

wk2.pcap 

640,110 

107,405,071 

97,163,287 

228.85 

151.79 

wk3.pcap 

899,064 

156,829,234 

142,444,186 

170.55 

158.44 

wk4.pcap 

708,272 

124,405,656 

113,073,280 

266.94 

159.65 

wk5.pcap 

897,033 

150,206,030 

135,853,478 

328.73 

151.45 

Table  A. 4:  Merged  Data  Files  Time  -  Duration  is  in  seconds.  Start  and  End  are  specified 
in  Unix  epoch  time  format.  That  is,  they  are  specified  in  seconds  since  00:00:00  UTC  on 
January  1,  1970. 


|  Merged  Data  Files  Time  | 

Name 

Duration 

Start  Time 

Start 

End  Time 

End 

wkl.pcap 

424481.648063 

Mon  Mar  1  08:00:40  1999 

920293240 

Sat  Mar  6  05:55:22  1999 

920717722 

wk2.pcap 

424565.061365 

Mon  Mar  8  08:00:04  1999 

920898004 

Sat  Mar  13  05:56:09  1999 

921322569 

wk3.pcap 

835181.388736 

Mon  Mar  15  08:00:18  1999 

921502818 

Wed  Mar  24  23:59:59  1999 

922337999 

wk4.pcap 

423583.955343 

Mon  Mar  29  08:00:02  1999 

922712402 

Sat  Apr  3  05:39:46  1999 

923135986 

wk5.pcap 

413264.972580 

Mon  Apr  5  08:00:31  1999 

923313631 

Sat  Apr  10  02:48:16  1999 

923726896 

100 


Table  A. 5:  SMTP  TCP  Connection  Summary  -  Note  that  like  POP3  the  weekly 
pcap  hies  do  not  correctly  contain  the  entire  TCP  connection  for  several  of  the  SMTP 
connections.  The  Total  column  is  NOT  the  sum  of  the  row. 


SMTP  TCP  Connection  Summary  | 

TCP  Operation 

Week  1 

Week  2 

Week  3 

Week  4 

Week  5 

Total 

Percent 

TCPopen 

19,424 

22,868 

32,180 

25,155 

27,783 

126,545 

100 

TCPclose 

18,687 

22,172 

31,212 

24,398 

23,351 

119,791 

25.25 

TCPreset 

5 

9 

60 

253 

3,730 

3,957 

73.81 

TCPtimeout 

0 

0 

146 

0 

424 

2,519 

0.07 

NIDSexit 

732 

687 

762 

504 

278 

278 

0.85 

Total  termination  conditions:  126,545  | 

A. 5  SMTP  Sample  Data 


Table  A. 7:  Data  Summary:  SMTP  Command  Alphabet  total. pcap 


Data  Summary:  SMTP  Command  Alphabet  total. pcap  j 

Type! 

Count 

Percent 

Label 

Sample 

1 

0.000790233 

+ 

TCPopen  EHLO  MAIL  RCPT  DATA  .  TCPclose 

2 

1 

0.000790233 

+ 

TCPopen  EHLO  MAIL  RCPT(31)  QUIT  TCPclose 

3 

1 

0.000790233 

+ 

TCPopen  HELO  MAIL  RCPT  QUIT  TCPclose 

4 

1 

0.000790233 

+ 

TCPopen  HELO  MAIL  RCPT(6)  DATA  .  QUIT  TCPclose 

5 

1 

0.000790233 

+ 

TCPopen  HELO  MAIL  RCPT(7)  DATA  .  QUIT  TCPclose 

6 

1 

0.000790233 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  TCPtimeout 

7 

1 

0.000790233 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(ll)  DATA  TCPtimeout 

8 

1 

0.000790233 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(2)  DATA  .  NIDSexit 

9 

1 

0.000790233 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(20)  TCPtimeout 

10 

1 

0.000790233 

- 

TCPopen  EHLO  HELO  TCPtimeout 

11 

1 

0.000790233 

- 

TCPopen  EHLO  MAIL  RCPT  DATA  .  TCPreset 

12 

1 

0.000790233 

- 

TCPopen  EHLO  MAIL  RCPT  DATA  .  TCPtimeout 

13 

1 

0.000790233 

- 

TCPopen  EHLO  MAIL  RCPT  DATA  NIDSexit 

14 

1 

0.000790233 

- 

TCPopen  EHLO  MAIL  TCPtimeout 

15 

1 

0.000790233 

- 

TCPopen  EHLO  TCPtimeout 

16 

1 

0.000790233 

- 

TCPopen  HELO  MAIL  RCPT  DATA  .  QUIT  NIDSexit 

17 

1 

0.000790233 

- 

TCPopen  HELO  MAIL  TCPtimeout 

18 

1 

0.000790233 

- 

TCPopen  HELO  TCPtimeout 

19 

2 

0.00158047 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(7)  DATA  .  QUIT  TCPclose 

20 

2 

0.00158047 

+ 

TCPopen  EHLO  MAIL  RCPT(2)  QUIT  TCPclose 

21 

2 

0.00158047 

+ 

TCPopen  EHLO  MAIL  RCPT(4)  QUIT  TCPclose 

22 

2 

0.00158047 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  QUIT  TCPtimeout 

23 

2 

0.00158047 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  TCPreset 

24 

2 

0.00158047 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(5)  DATA  TCPclose 

25 

2 

0.00158047 

- 

TCPopen  EHLO  MAIL  RCPT(7)  DATA  TCPclose 

26 

2 

0.00158047 

- 

TCPopen  HELO  MAIL  RCPT  DATA  .  QUIT  TCPtimeout 

27 

2 

0.00158047 

- 

TCPopen  mail  TCPclose 

28 

3 

0.0023707 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  RSET  MAIL  RCPT  DATA  . 

RSET  MAIL  RCPT  DATA  .  RSET  MAIL  RCPT  DATA  .  RSET  MAIL  RCPT 

DATA  .  RSET  MAIL  RCPT  DATA  .  QUIT  TCPclose 

29 

3 

0.0023707 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  NIDSexit 

30 

3 

0.0023707 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(2)  DATA  .  TCPtimeout 

31 

3 

0.0023707 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(30)  DATA  .  TCPtimeout 

32 

3 

0.0023707 

- 

TCPopen  HELP  TCPclose 

Continued  on  next  page 
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Table  A. 7  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

33 

4 

0.00316093 

+ 

TCPopen  EHLO  MAIL  RCPT(3)  QUIT  TCPclose 

34 

4 

0.00316093 

+ 

TCPopen  HELO  MAIL  RCPT(4)  DATA  .  QUIT  TCPclose 

35 

4 

0.00316093 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(3)  DATA  TCPclose 

36 

4 

0.00316093 

- 

TCPopen  EHLO  MAIL  RCPT  DATA  TCPclose 

37 

5 

0.00395116 

+ 

TCPopen  HELO  MAIL  RCPT(5)  DATA  .  QUIT  TCPclose 

38 

6 

0.0047414 

+ 

TCPopen  EHLO  MAIL  RCPT(30)  QUIT  TCPclose 

39 

6 

0.0047414 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  NIDSexit 

40 

6 

0.0047414 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(30)  DATA  TCPtimeout 

41 

6 

0.0047414 

- 

TCPopen  QUIT  TCPclose 

42 

7 

0.00553163 

+ 

TCPopen  EHLO  MAIL  RCPT  RSET  QUIT  TCPclose 

43 

10 

0.00790233 

- 

TCPopen  EHLO  HELO  MAIL  RCPT(2)  DATA  TCPclose 

44 

20 

0.0158047 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(31)  DATA  .  QUIT  TCPclose 

45 

21 

0.0165949 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  QUIT  TCPreset 

46 

23 

0.0181754 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  TCPtimeout 

47 

26 

0.0205461 

- 

TCPopen  TCPtimeout 

48 

33 

0.0260777 

- 

TCPopen  NIDSexit 

49 

36 

0.0284484 

+ 

TCPopen  HELO  MAIL  RCPT(3)  DATA  .  QUIT  TCPclose 

50 

57 

0.0450433 

+ 

TCPopen  EHLO  MAIL  RCPT(4)  DATA  .  QUIT  TCPclose 

51 

58 

0.0458335 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  TCPreset 

52 

64 

0.0505749 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  TCPclose 

53 

65 

0.0513651 

+ 

TCPopen  EHLO  MAIL  RCPT(3)  DATA  .  QUIT  TCPclose 

54 

93 

0.0734916 

+ 

TCPopen  EHLO  MAIL  RCPT(7)  DATA  .  QUIT  TCPclose 

55 

103 

0.081394 

+ 

TCPopen  EHLO  MAIL  RCPT  QUIT  TCPclose 

56 

112 

0.0885061 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(6)  DATA  .  QUIT  TCPclose 

57 

185 

0.146193 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(5)  DATA  .  QUIT  TCPclose 

58 

189 

0.149354 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(ll)  DATA  .  QUIT  TCPclose 

59 

190 

0.150144 

+ 

TCPopen  EHLO  MAIL  RCPT(10)  DATA  .  QUIT  TCPclose 

60 

199 

0.157256 

+ 

TCPopen  EHLO  MAIL  RCPT(24)  DATA  .  QUIT  TCPclose 

61 

212 

0.167529 

+ 

TCPopen  HELO  MAIL  RCPT(2)  DATA  .  QUIT  TCPclose 

62 

233 

0.184124 

- 

TCPopen  HELO  MAIL  RCPT  DATA  .  NIDSexit 

63 

242 

0.191236 

- 

TCPopen  HELO  MAIL  RCPT  DATA  .  TCPreset 

64 

433 

0.342171 

+ 

TCPopen  EHLO  MAIL  RCPT(2)  DATA  .  QUIT  TCPclose 

65 

449 

0.354814 

- 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  TCPtimeout 

66 

560 

0.44253 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(4)  DATA  .  QUIT  TCPclose 

67 

1494 

1.18061 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(30)  DATA  .  QUIT  TCPclose 

68 

1670 

1.31969 

- 

TCPopen  mail  rcpt  TCPclose 

69 

1933 

1.52752 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(3)  DATA  .  QUIT  TCPclose 

70 

1996 

1.5773 

- 

TCPopen  HELO  MAIL  RCPT  DATA  .  TCPtimeout 

71 

3150 

2.48923 

+ 

TCPopen  HELO  MAIL  RCPT  DATA  .  QUIT  TCPclose 

72 

3155 

2.49318 

- 

TCPopen  TCPclose 

73 

3633 

2.87092 

- 

TCPopen  TCPreset 

74 

3887 

3.07163 

+ 

TCPopen  EHLO  MAIL  RCPT  DATA  .  QUIT  TCPclose 

75 

7389 

5.83903 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT(2)  DATA  .  QUIT  TCPclose 

76 

94522 

74.6944 

+ 

TCPopen  EHLO  HELO  MAIL  RCPT  DATA  .  QUIT  TCPclose 

76  Start  with:  TCPopen 

45  End  with:  TCPclose  (119,791  samples) 

6  End  with:  TCPreset  (3,957  samples) 

18  End  with:  TCPtimeout  (2,519  samples  ) 

7  End  with:  NIDSexit  (278  samples  ) 

34  Positive  sample  Types.  114,869  Positive  samples. 

42  Negative  sample  Types.  11,676  Negative  samples. 

126,545  Total  Samples 

76  Unique  Types 
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Table  A. 8:  Data  Summary:  SMTP  Reply  Alphabet  total. pcap 


Data  Summary:  SMTP  Reply  Alphabet  total. pcap  j 

Type 

Count 

Percent 

Label 

Sample 

1 

1 

0.000790233 

+ 

TCPopen  220  250(2)  551(31)  221  TCPclose 

2 

1 

0.000790233 

+ 

TCPopen  220  250(3)  354  250  TCPclose 

3 

1 

0.000790233 

+ 

TCPopen  220  250(8)  354  250  221  TCPclose 

4 

1 

0.000790233 

- 

TCPopen  220  250(3)  354  250  221  NIDSexit 

5 

1 

0.000790233 

- 

TCPopen  220  250(3)  354  250  TCPreset 

6 

1 

0.000790233 

- 

TCPopen  220  250(3)  354  552  NIDSexit 

7 

1 

0.000790233 

- 

TCPopen  220  250(3)  354  552  TCPtimeout 

8 

1 

0.000790233 

- 

TCPopen  220  250(3)  503  500  221  TCPclose 

9 

1 

0.000790233 

- 

TCPopen  220  500  250  TCPtimeout 

10 

1 

0.000790233 

- 

TCPopen  220  500  250(13)  354  250  221  TCPtimeout 

11 

1 

0.000790233 

- 

TCPopen  220  500  250(2)  TCPtimeout 

12 

1 

0.000790233 

- 

TCPopen  220  500  250(21)  TCPtimeout 

13 

1 

0.000790233 

- 

TCPopen  220  500  250(3)  354  250(2)  NIDSexit 

14 

1 

0.000790233 

- 

TCPopen  220  500  250(3)  TCPtimeout 

15 

1 

0.000790233 

- 

TCPopen  220  500  250(32)  354  TCPtimeout 

16 

1 

0.000790233 

- 

TCPopen  220  500  250(4)  354  NIDSexit 

17 

2 

0.00158047 

+ 

TCPopen  220  250(2)  551(2)  221  TCPclose 

18 

2 

0.00158047 

+ 

TCPopen  220  250(2)  551(4)  221  TCPclose 

19 

2 

0.00158047 

+ 

TCPopen  220  500  250(9)  354  250  221  TCPclose 

20 

2 

0.00158047 

- 

TCPopen  220  250  TCPtimeout 

21 

2 

0.00158047 

- 

TCPopen  220  250(2)  503  500  221  TCPclose 

22 

2 

0.00158047 

- 

TCPopen  220  250(3)  354  250  221  TCPtimeout 

23 

2 

0.00158047 

- 

TCPopen  220  451  TCPreset 

24 

2 

0.00158047 

- 

TCPopen  220  500  250(3)  354  250  221  NIDSexit 

25 

3 

0.0023707 

+ 

TCPopen  220  500  250(3)  354  250(4)  354  250(4)  354  250(4)  354  250(4)  354  250(4) 

354  250  221  TCPclose 

26 

3 

0.0023707 

- 

TCPopen  220  500  250(3)  354  250  TCPtimeout 

27 

3 

0.0023707 

- 

TCPopen  220  500  250(4)  354  250  221  TCPtimeout 

28 

4 

0.00316093 

+ 

TCPopen  220  250(2)  551(3)  221  TCPclose 

29 

5 

0.00395116 

+ 

TCPopen  220  250(7)  354  250  221  TCPclose 

30 

6 

0.0047414 

+ 

TCPopen  220  250(2)  551(30)  221  TCPclose 

31 

6 

0.0047414 

- 

TCPopen  220  221  TCPclose 

32 

6 

0.0047414 

- 

TCPopen  220  500  250(3)  354  NIDSexit 

33 

6 

0.0047414 

- 

TCPopen  220  TCPclose 

34 

7 

0.00553163 

+ 

TCPopen  220  250(2)  551  250  221  TCPclose 

35 

8 

0.00632186 

- 

TCPopen  220  500  250(32)  354  250  221  TCPtimeout 

36 

10 

0.00790233 

- 

TCPopen  220  TCPtimeout 

37 

11 

0.00869256 

- 

TCPopen  220  500  250(3)  354  250  221  TCPtimeout 

38 

18 

0.0142242 

- 

TCPopen  TCPtimeout 

39 

20 

0.0158047 

+ 

TCPopen  220  500  250(33)  354  250  221  TCPclose 

40 

22 

0.0173851 

- 

TCPopen  220  500  250(3)  354  250  TCPreset 

41 

25 

0.0197558 

- 

TCPopen  220  250(2)  354  250  TCPclose 

42 

33 

0.0260777 

+ 

TCPopen  220  250(3)  354  250(2)  TCPclose 

43 

33 

0.0260777 

- 

TCPopen  NIDSexit 

44 

37 

0.0292386 

- 

TCPopen  220  451  421  TCPreset 

45 

59 

0.0466237 

- 

TCPopen  220  500  250(3)  354  TCPreset 

46 

61 

0.0482042 

+ 

TCPopen  220  250(6)  354  250  221  TCPclose 

47 

75 

0.0592675 

- 

TCPopen  TCPreset 

48 

96 

0.0758623 

+ 

TCPopen  220  250(9)  354  250  221  TCPclose 

49 

101 

0.0798135 

+ 

TCPopen  220  250(5)  354  250  221  TCPclose 
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Table  A. 8  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

50 

104 

0.0821842 

+ 

TCPopen  220  250(2)  551  221  TCPclose 

51 

112 

0.0885061 

+ 

TCPopen  220  500  250(8)  354  250  221  TCPclose 

52 

153 

0.120906 

- 

TCPopen  TCPclose 

53 

170 

0.13434 

- 

TCPopen  220  451  421  TCPclose 

54 

187 

0.147774 

+ 

TCPopen  220  500  250(7)  354  250  221  TCPclose 

55 

189 

0.149354 

+ 

TCPopen  220  500  250(13)  354  250  221  TCPclose 

56 

190 

0.150144 

+ 

TCPopen  220  250(12)  354  250  221  TCPclose 

57 

199 

0.157256 

+ 

TCPopen  220  250(26)  354  250  221  TCPclose 

58 

209 

0.165159 

- 

TCPopen  220  421  TCPreset 

59 

233 

0.184124 

- 

TCPopen  220  250(3)  354  NIDSexit 

60 

242 

0.191236 

- 

TCPopen  220  250(3)  354  TCPreset 

61 

459 

0.362717 

- 

TCPopen  220  500  250(3)  354  TCPtimeout 

62 

560 

0.44253 

+ 

TCPopen  220  500  250(6)  354  250  221  TCPclose 

63 

645 

0.5097 

+ 

TCPopen  220  250(4)  354  250  221  TCPclose 

64 

1494 

1.18061 

+ 

TCPopen  220  500  250(32)  354  250  221  TCPclose 

65 

1937 

1.53068 

+ 

TCPopen  220  500  250(5)  354  250  221  TCPclose 

66 

1996 

1.5773 

- 

TCPopen  220  250(3)  354  TCPtimeout 

67 

3310 

2.61567 

- 

TCPopen  220  TCPreset 

68 

4473 

3.53471 

- 

TCPopen  220  250(2)  354  250  221  TCPclose 

69 

7008 

5.53795 

+ 

TCPopen  220  250(3)  354  250  221  TCPclose 

70 

7399 

5.84693 

+ 

TCPopen  220  500  250(4)  354  250  221  TCPclose 

71 

14633 

11.5635 

+ 

TCPopen  220  500  250(3)  354  250(2)  TCPclose 

72 

79953 

63.1815 

+ 

TCPopen  220  500  250(3)  354  250  221  TCPclose 

72  Start  with:  TCPopen 


38  End  with:  TCPclose  (119,791  samples) 

9  End  with:  TCPreset  (3,957  samples) 

17  End  with:  TCPtimeout  (2,519  samples  ) 

8  End  with:  NIDSexit  (278  samples  ) 

30  Positive  sample  Types.  114,955  Positive  samples. 
42  Negative  sample  Types.  11,590  Negative  samples. 
126,545  Total  Samples 
72  Unique  Types 


Table  A. 9:  Data  Summary:  SMTP  Composite  Alphabet  total. pcap 


Data  Summary:  SMTP  Composite  Alphabet  total. pcap  j 

Type 

Count 

Percent 

Label 

Sample 

1 

1 

0.000790233 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  DATA  354  .  250  TCPclose 

2 

1 

0.000790233 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  RCPT  551  RCPT  551  RCPT 

551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT 

551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT 

551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT 

551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  QUIT 

221  TCPclose 

3 

1 

0.000790233 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

4 

1 

0.000790233 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

5 

1 

0.000790233 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  551  QUIT  221  TCPclose 

6 

1 

0.000790233 

- 

TCPopen  220  250(3)  503  HELP  500  221  TCPclose 
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Table  A. 9  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

7 

1 

0.000790233 

- 

TCPopen  220  500  250(3)  354  250  221  TCPtimeout 

8 

1 

0.000790233 

" 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  DATA  354  .  250  221  TCPtime¬ 
out 

9 

1 

0.000790233 

- 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  DATA  354  .  250  TCPreset 

10 

1 

0.000790233 

- 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  DATA  354  250  221  NIDSexit 

11 

1 

0.000790233 

- 

TCPopen  220  EHLO  250  MAIL  TCPtimeout 

12 

1 

0.000790233 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

QUIT  TCPtimeout 

13 

1 

0.000790233 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

TCPreset 

14 

1 

0.000790233 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  250(2) 

NIDSexit 

15 

1 

0.000790233 

- 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  TCPtimeout 

16 

1 

0.000790233 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  DATA 

354  .  NIDSexit 

17 

1 

0.000790233 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  DATA  354  250  221  TCPtimeout 

18 

1 

0.000790233 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  TCPtimeout 

19 

1 

0.000790233 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  TCPtimeout 

20 

1 

0.000790233 

- 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  TCPtimeout 

21 

1 

0.000790233 

- 

TCPopen  220  EHLO  500  HELO  250  TCPtimeout 

22 

1 

0.000790233 

- 

TCPopen  220  EHLO  HELO  MAIL  RCPT  DATA  .  QUIT  TCPtimeout 

23 

1 

0.000790233 

- 

TCPopen  220  EHLO  TCPtimeout 

24 

1 

0.000790233 

" 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250  QUIT  221 

TCPtimeout 

25 

1 

0.000790233 

- 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  552  QUIT  NIDSexit 

26 

1 

0.000790233 

" 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  552  .  QUIT  TCPti¬ 
meout 

27 

1 

0.000790233 

- 

TCPopen  220  HELO  250  MAIL  TCPtimeout 

28 

1 

0.000790233 

- 

TCPopen  220  HELO  TCPtimeout 

29 

2 

0.00158047 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  RCPT  551  QUIT  221  TCPclose 

30 

2 

0.00158047 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  RCPT  551  RCPT  551  RCPT 

551  QUIT  221  TCPclose 

31 

2 

0.00158047 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA  354  .  250  QUIT  221 

TCPclose 

32 

2 

0.00158047 

- 

TCPopen  220  250(2)  503  HELP  500  221  TCPclose 

33 

2 

0.00158047 

- 

TCPopen  220  451  TCPreset 

34 

2 

0.00158047 

" 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  DATA  354  250  221  TCPclose 

35 

2 

0.00158047 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

TCPtimeout 

36 

2 

0.00158047 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  250  221 

NIDSexit 

37 

2 

0.00158047 

- 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  TCPreset 
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Table  A. 9  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

38 

2 

0.00158047 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  DATA  354  250  221  TCPclose 

39 

2 

0.00158047 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  250  221  TCPtimeout 

40 

2 

0.00158047 

- 

TCPopen  220  mail  250(2)  354  250  221  TCPclose 

41 

3 

0.0023707 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

RSET  250  MAIL  250  RCPT  250  DATA  354  .  250  RSET  250  MAIL  250  RCPT 

250  DATA  354  .  250  RSET  250  MAIL  250  RCPT  250  DATA  354  .  250  RSET 

250  MAIL  250  RCPT  250  DATA  354  .  250  RSET  250  MAIL  250  RCPT  250 

DATA  354  .  250  QUIT  221  TCPclose 

42 

3 

0.0023707 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  DATA 

354  .  250  221  TCPtimeout 

43 

4 

0.00316093 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  RCPT  551  RCPT  551  QUIT 

221  TCPclose 

44 

4 

0.00316093 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  DATA  354  .  250  QUIT  221  TCPclose 

45 

4 

0.00316093 

- 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  DATA  354  250  221  TCPclose 

46 

4 

0.00316093 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250  221 

TCPtimeout 

47 

4 

0.00316093 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  DATA  354  250  221  TCPclose 

48 

5 

0.00395116 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

49 

6 

0.0047414 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  RCPT  551  RCPT  551  RCPT  551 

RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551 

RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551 

RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551 

RCPT  551  RCPT  551  RCPT  551  RCPT  551  RCPT  551  QUIT  221  TCPclose 

50 

6 

0.0047414 

- 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  NIDSexit 

51 

6 

0.0047414 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  250  221 

TCPtimeout 

52 

6 

0.0047414 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  250  221  TCPtimeout 

53 

6 

0.0047414 

- 

TCPopen  220  TCPclose 

54 

6 

0.0047414 

- 

TCPopen  QUIT  220  221  TCPclose 

55 

7 

0.00553163 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  RSET  250  QUIT  221  TCPclose 

56 

7 

0.00553163 

- 

TCPopen  220  TCPtimeout 

57 

10 

0.00790233 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  DATA 

354  250  221  TCPclose 

58 

12 

0.00948279 

- 

TCPopen  220  250(2)  354  250  TCPclose 

59 

13 

0.010273 

- 

TCPopen  220  mail  250  rcpt  250  354  250  TCPclose 

60 

16 

0.0126437 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  TCPti¬ 
meout 

61 

18 

0.0142242 

- 

TCPopen  TCPtimeout 
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Type 

Count 

Percent 

Label 

Sample 

62 

20 

0.0158047 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  DATA  354  .  250  QUIT  221  TCPclose 

63 

21 

0.0165949 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

QUIT  TCPreset 

64 

33 

0.0260777 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250  QUIT  250 

TCPclose 

65 

33 

0.0260777 

- 

TCPopen  NIDSexit 

66 

36 

0.0284484 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  250  QUIT  221  TCPclose 

67 

37 

0.0292386 

- 

TCPopen  220  451  421  TCPreset 

68 

57 

0.0450433 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  DATA  354  .  250  QUIT  221  TCPclose 

69 

57 

0.0450433 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  TCP- 

reset 

70 

64 

0.0505749 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  250  221 

TCPclose 

71 

65 

0.0513651 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  250  QUIT  221  TCPclose 

72 

75 

0.0592675 

- 

TCPopen  TCPreset 

73 

93 

0.0734916 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

74 

103 

0.081394 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  551  QUIT  221  TCPclose 

75 

112 

0.0885061 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

76 

153 

0.120906 

- 

TCPopen  TCPclose 

77 

170 

0.13434 

- 

TCPopen  220  451  421  TCPclose 

78 

185 

0.146193 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

79 

189 

0.149354 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 

80 

190 

0.150144 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  250  QUIT  221  TCPclose 

81 

199 

0.157256 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  250  QUIT  221  TCPclose 

82 

209 

0.165159 

- 

TCPopen  220  421  TCPreset 

83 

212 

0.167529 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  RCPT  250  DATA  354  .  250 

QUIT  221  TCPclose 

84 

233 

0.184124 

- 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  NIDSexit 

85 

242 

0.191236 

- 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  TCPreset 

86 

433 

0.342171 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  RCPT  250  DATA  354  .  250 

QUIT  221  TCPclose 

87 

443 

0.350073 

" 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  TCP- 

timeout 

88 

560 

0.44253 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  DATA  354  .  250  QUIT  221  TCPclose 
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Type 

Count 

Percent 

Label 

Sample 

89 

1494 

1.18061 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT 

250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  RCPT  250  DATA 

354  .  250  QUIT  221  TCPclose 

90 

1657 

1.30942 

- 

TCPopen  220  mail  250  rcpt  250  354  250  221  TCPclose 

91 

1933 

1.52752 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  RCPT 

250  DATA  354  .  250  QUIT  221  TCPclose 

92 

1996 

1.5773 

- 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  TCPtimeout 

93 

2814 

2.22371 

- 

TCPopen  220  250(2)  354  250  221  TCPclose 

94 

3117 

2.46316 

+ 

TCPopen  220  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250  QUIT  221 

TCPclose 

95 

3310 

2.61567 

- 

TCPopen  220  TCPreset 

96 

3887 

3.07163 

+ 

TCPopen  220  EHLO  250  MAIL  250  RCPT  250  DATA  354  .  250  QUIT  221 

TCPclose 

97 

7389 

5.83903 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  RCPT  250  DATA 

354  .  250  QUIT  221  TCPclose 

98 

14633 

11.5635 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

QUIT  250  TCPclose 

99 

79889 

63.1309 

+ 

TCPopen  220  EHLO  500  HELO  250  MAIL  250  RCPT  250  DATA  354  .  250 

QUIT  221  TCPclose 

99  Start  with:  TCPopen 

53  End  with:  TCPclose  (119,791  samples) 

11  End  with:  TCPreset  (3,957  samples) 

27  End  with:  TCPtimeout  (2,519  samples  ) 

8  End  with:  NIDSexit  (278  samples  ) 

36  Positive  sample  Types.  114,869  Positive  samples. 

63  Negative  sample  Types.  11,676  Negative  samples. 

126,545  Total  Samples 

99  Unique  Types 
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Table  A.  10:  POP3  TCP  Connection  Summary  -  Note  that  the  weekly  trace  hies  do 
not  include  the  complete  TCP  connection  for  all  POP3  sessions.  The  Total  column 
does  NOT  sum  the  row. 


|  POP3  TCP  Connection  Summary 

TCP  Operation 

Week  1 

Week  2 

Week  3 

Week  4 

Week  5 

Total 

Percent 

TCPopen 

255 

238 

395 

363 

9,709 

10,960 

100 

TCPclose 

254 

236 

395 

363 

1,520 

2,768 

25.25 

TCPreset 

0 

1 

0 

0 

8,089 

8,090 

73.81 

TCPtimeout 

0 

0 

0 

0 

6 

8 

0.07 

NIDSexit 

1 

1 

0 

0 

94 

94 

0.85 

|  Total  termination  conditions:  10,960  | 

A. 6  POP3  Sample  Data 


Table  A.  12:  Data  Summary:  POP3  Command  Alphabet  total. pcap 


Data  Summary:  POP3  Command  Alphabet  total. pcap  j 

Type 

Count 

Percent 

Label 

Sample 

■ 

1 

0.00912409 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT  TCPclose 

2 

1 

0.00912409 

- 

TCPopen  QUIT  TCPtimeout 

3 

1 

0.00912409 

- 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  QUIT  TCPtimeout 

4 

1 

0.00912409 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT  TCPtimeout 

5 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  QUIT  TCPclose 

6 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  QUIT  TCPclose 

7 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT 

TCPclose 

8 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  QUIT  TCPclose 

9 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT  TCPclose 

10 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  QUIT  TCPclose 
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Type 

Count 

Percent 

Label 

Sample 

11 

2 

0.0182482 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT 

TCPclose 

12 

2 

0.0182482 

- 

TCPopen  QUIT  TCPclose 

13 

2 

0.0182482 

- 

TCPopen  USER  TCPreset 

14 

5 

0.0456204 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

QUIT  TCPclose 

15 

5 

0.0456204 

- 

TCPopen  TCPtimeout 

16 

9 

0.0821168 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT  TCPclose 

17 

25 

0.228102 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  RETR  DELE  QUIT  TCPclose 

18 

32 

0.291971 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  QUIT  TCPclose 

19 

47 

0.428832 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  QUIT  TCPclose 

20 

69 

0.629562 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  QUIT  TCPclose 

21 

94 

0.857664 

- 

TCPopen  NIDSexit 

22 

136 

1.24088 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  RETR  DELE  QUIT 

TCPclose 

23 

197 

1.79745 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  RETR  DELE  QUIT  TCPclose 

24 

342 

3.12044 

+ 

TCPopen  USER  PASS  STAT  RETR  DELE  QUIT  TCPclose 

25 

631 

5.7573 

+ 

TCPopen  USER  PASS  STAT  QUIT  TCPclose 

26 

1258 

11.4781 

- 

TCPopen  TCPclose 

27 

8088 

73.7956 

- 

TCPopen  TCPreset 

27  Start  with:  TCPopen 

20  End  with:  TCPclose  (2,768  samples) 

2  End  with:  TCPreset  (8,090  samples) 

4  End  with:  TCPtimeout  (8  samples  ) 

1  End  with:  NIDSexit  (94  samples  ) 

18  Positive  sample  Types.  1,508  Positive  samples. 

9  Negative  sample  Types.  9,452  Negative  samples. 

10,960  Total  Samples 

27  Unique  Types 

Table  A.  13:  Data  Summary:  POP3  Reply  Alphabet  total. pcap 


|  Data  Summary:  POP3  Reply  Alphabet  total. pcap  j 

Type 

Count 

Percent 

Label 

Sample 

1 

1 

0.00912409 

+ 

TCPopen  OK(41)  TCPclose 

2 

1 

0.00912409 

- 

TCPopen  OK(2)  TCPclose 

3 

1 

0.00912409 

- 

TCPopen  OK(2)  TCPtimeout 

4 

2 

0.0182482 

+ 

TCPopen  OK(27)  TCPclose 

5 

2 

0.0182482 

+ 

TCPopen  OK(29)  TCPclose 

6 

2 

0.0182482 

+ 

TCPopen  OK(33)  TCPclose 

7 

2 

0.0182482 

+ 

TCPopen  OK(37)  TCPclose 

j  Continued  on  next  page  | 
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Table  A.  13  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

8 

2 

0.0182482 

+ 

TCPopen  OK(43)  TCPclose 

9 

2 

0.0182482 

+ 

TCPopen  OK(49)  TCPclose 

10 

2 

0.0182482 

+ 

TCPopen  OK(55)  TCPclose 

11 

2 

0.0182482 

- 

TCPopen  OK(15)  TCPtimeout 

12 

5 

0.0456204 

+ 

TCPopen  OK(23)  TCPclose 

13 

5 

0.0456204 

- 

TCPopen  TCPtimeout 

14 

9 

0.0821168 

+ 

TCPopen  OK(21)  TCPclose 

15 

15 

0.136861 

- 

TCPopen  TCPreset 

16 

25 

0.228102 

+ 

TCPopen  OK(19)  TCPclose 

17 

32 

0.291971 

+ 

TCPopen  OK(17)  TCPclose 

18 

47 

0.428832 

+ 

TCPopen  OK(15)  TCPclose 

19 

60 

0.547445 

- 

TCPopen  OK(2)  ERR  TCPclose 

20 

69 

0.629562 

+ 

TCPopen  OK(13)  TCPclose 

21 

94 

0.857664 

- 

TCPopen  NIDSexit 

22 

136 

1.24088 

+ 

TCPopen  OK(ll)  TCPclose 

23 

197 

1.79745 

+ 

TCPopen  OK(9)  TCPclose 

24 

342 

3.12044 

+ 

TCPopen  OK(7)  TCPclose 

25 

631 

5.7573 

+ 

TCPopen  OK(5)  TCPclose 

26 

1199 

10.9398 

- 

TCPopen  TCPclose 

27 

8075 

73.677 

- 

TCPopen  OK  TCPreset 

27  Start  with:  TCPopen 

21  End  with:  TCPclose  (2,768  samples) 

2  End  with:  TCPreset  (8,090  samples) 

3  End  with:  TCPtimeout  (8  samples  ) 

1  End  with:  NIDSexit  (94  samples  ) 

18  Positive  sample  Types.  1,508  Positive  samples. 

9  Negative  sample  Types.  9,452  Negative  samples. 

10,960  Total  Samples 

27  Unique  Types 

Table  A.  14:  Data  Summary:  POP3  Composite  Alphabet  total. pcap 


j  Data  Summary:  POP3  Composite  Alphabet  total. pcap  | 

Type 

Count 

Percent 

Label 

Sample 

1 

1 

0.00912409 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

2 

1 

0.00912409 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

QUIT  OK  TCPtimeout 

3 

1 

0.00912409 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR 

OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE 

OK  RETR  OK  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE 

RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR  DELE  RETR 

DELE  RETR  DELE  RETR  DELE  QUIT  TCPtimeout 

4 

1 

0.00912409 

- 

TCPopen  QUIT  OK(2)  TCPclose 

5 

1 

0.00912409 

- 

TCPopen  QUIT  OK(2)  TCPtimeout 

|  Continued  on  next  page  | 
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Table  A.  14  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

6 

1 

0.00912409 

- 

TCPopen  QUIT  TCPclose 

7 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

8 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

QUIT  OK  TCPclose 

9 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

10 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  QUIT  OK  TCPclose 

11 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

QUIT  OK  TCPclose 

12 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK 

TCPclose 

13 

2 

0.0182482 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

14 

2 

0.0182482 

- 

TCPopen  OK  USER  TCPreset 

15 

5 

0.0456204 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  QUIT  OK  TCPclose 

16 

5 

0.0456204 

- 

TCPopen  TCPtimeout 

17 

9 

0.0821168 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK 

TCPclose 

18 

15 

0.136861 

- 

TCPopen  TCPreset 

Continued  on  next  page 
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Table  A.  14  —  continued  from  previous  page 


Type 

Count 

Percent 

Label 

Sample 

19 

25 

0.228102 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

20 

32 

0.291971 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

RETR  OK  DELE  OK  QUIT  OK  TCPclose 

21 

47 

0.428832 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK 

QUIT  OK  TCPclose 

22 

60 

0.547445 

- 

TCPopen  OK(2)  ERR  TCPclose 

23 

69 

0.629562 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

24 

94 

0.857664 

- 

TCPopen  NIDSexit 

25 

136 

1.24088 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  RETR  OK  DELE  OK  QUIT  OK  TCPclose 

26 

197 

1.79745 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  RETR  OK 

DELE  OK  QUIT  OK  TCPclose 

27 

342 

3.12044 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  RETR  OK  DELE  OK  QUIT  OK 

TCPclose 

28 

631 

5.7573 

+ 

TCPopen  OK  USER  OK  PASS  OK  STAT  OK  QUIT  OK  TCPclose 

29 

1198 

10.9307 

- 

TCPopen  TCPclose 

30 

8073 

73.6588 

- 

TCPopen  OK  TCPreset 

30  Start  with:  TCPopen 

22  End  with:  TCPclose  (2,768  samples) 

3  End  with:  TCPreset  (8,090  samples) 

4  End  with:  TCPtimeout  (8  samples  ) 

1  End  with:  NIDSexit  (94  samples  ) 

18  Positive  sample  Types.  1,508  Positive  samples. 

12  Negative  sample  Types.  9,452  Negative  samples. 

10,960  Total  Samples 

30  Unique  Types 
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Appendix  B.  Low  Level  Implementation 

This  appendix  introduces  supporting  toolkits  we  reviewed,  as  well  as  the  environment 
and  tools  used  in  the  development  process.  Finally,  we  briefly  introduce  a  low  level 
look  at  flowtool  and  flowinfer. 

B.l  Sources  for  Protocol  Formats 

There  are  rich  repositories  we  can  mine  for  protocol  format  information.  Several 
open  source  projects  for  trace  collection  and  intrusion  detection  provide  source  code. 

Wireshark  contains  a  large  body  of  protocol  formats  (approximately  700  at  this 
time)  that  have  been  gleaned  from  open  specifications,  source  code,  and  community 
reverse  engineering  efforts  [48].  JNetStream  specifies  network  protocol  formats  in  a 
description  language  called  Network  Protocol  Language  (NPL)  [19].  The  NPL  speci¬ 
fications  are  compiled  to  Java  class  files  by  the  included  nplc  compiler  [19].  Netdude 
supports  protocol  formats  that  are  hardcoded  in  c  source  and  PHDL,  a  packet  header 
description  language  [133].  Both  Wireshark  and  jNetStream  are  protocol  analyzers 
while  Netdude  concentrates  on  packet  trace  manipulation  and  presentation  over  anal¬ 
ysis. 

Bro  is  a  research  oriented  Intrusion  Detection  System  [140].  Bro  provides  the 
protocol  description  language  binpac,  a  binpac  language  compiler,  and  protocol  de¬ 
scriptions  with  it  source  code  [140,192],  The  binpac  protocol  descriptions  cover  a 
range  of  ASCII  text  and  binary  protocols.  Snort  is  another  popular  open  source  IDS 
that  embodies  hand  coded  application  layer  protocol  parsers  [98]. 

There  are  also  several  data  description  languages  that  are  potentially  useful  for 
describing  ad  hoc  mixed  binary  and  text  data  such  as  protocol  formats.  Fisher  et 
al  propose  automated  inference  of  ad  hoc  data  to  generate  PADS  data  description 
language  [81].  PacketTypes  is  another  data  description  language  specifically  designed 
for  specifying  network  protocol  messages  [164],  DataScript  is  a  specification  language 
for  binary  data  formats  [13].  Other  options  include  direct  use  of  Augmented  BNF  or 
Abstract  Syntax  Notation  number  One  (ASN.l). 
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B.2  Automata  Toolkits 


There  are  several  frameworks  that  contain  foundational  language  theoretic  ele¬ 
ments  but  do  not  include  GI  algorithms. 

B.2.1  AMoRE.  AMoRE  [161]  is  used  by  Berg  [23,  Section  4]  as  the  basis  for 
both  a  direct  implementation  and  domain  optimized  version  of  Angluin’s  learner.  The 
AMoRE  library  is  implemented  as  a  library  in  portable  C.  While  the  AMoRE  source 
code  is  available  the  build  system  was  not  compatible  with  the  build  system  on  our 
development  system.  For  testing  purposes  we  ported  the  build  system  to  autoconf 
AMoRE  was  also  used  by  the  MERLin  Project  [267]. 

B.2. 2  Vaucanson.  Vaucansoi^Jis  a  C++  based  automata  library  maintained 
by  the  EPITA  Research  and  Development  Laboratory  (the  same  research  group  also 
maintains  the  Mical  GI  package)  [86] .  The  version  available  to  the  public  at  the  time  of 
writing  is  1.1.1.  Lombardy  [152]  explains  Vaucanson  in  detail  and  provides  comparison 
to  the  AMoRE  library.  An  early  version  of  Vaucanson  provides  the  automata  engine 
for  the  Mical  GI  package  [213]. 

B.2. 3  JFLAP.  JFLAP  is  an  automata  learning  tool  implemented  in  Java 
that  supports  many  of  the  core  operations  required  for  grammatical  inference  [218, 
219].  Source  code  for  version  4  is  available  for  download  but  source  code  for  newer 
versions  5  and  6  must  be  requested  from  the  author  [219].  JFLAP  4  is  used  in  the 
kBehavior  algorithm  implementation  [158,198]. 

B.2. 4  Grail.  Grail  [276]  and  it’s  more  recent  incarnation  as  Grail+  [285] 
provide  a  range  of  useful  automata  operations.  Unfortunately,  the  C++  source  code 
is  rather  dated  and  the  build  system  does  not  support  our  development  platform.  A 
modified  version  of  Grail  version  2.5  was  created  by  the  MERLin  Project  to  study 

1  Autoconf  http : / /www . gnu . org/ software/ autoconf / 

2Vaucanson  <  Project  http://www.lrde.epita.fr/cgi-bin/twiki/view/Projects/ 

Vaucanson 
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generalized  NFA  [267].  The  modified  version  provides  build  targets  for  Linux  but  not 
for  contemporary  versions  of  GCC. 

B.3  Grammatical  Inference  Implementations 

While  there  is  no  single  definitive  collection  of  GI  algorithms  source  code  for 
several  of  the  algorithms  discussed  above  are  available.  Also,  LearnLib  and  Mical 
frameworks  each  provide  a  subset  of  the  GI  algorithms. 

B.3.1  LearnLib.  LearnLitJ^Jis  a  C++  library  [211]  that  provides  an  architec¬ 
ture  for  GI  algorithm  testing.  Because  the  library  is  designed  using  Common  Object 
Request  Broker  Architecture  (CORBA)|^]  interfaces  it  can  be  used  from  any  language 
with  CORBA  support.  At  this  time  it  only  implements  Angluin’s  L*  and  a  mod¬ 
ified  L*.  LearnLib  is  incorporated  into  the  Formal  Methods  for  Industrial  Critical 
Systems-Java  Electronic  Tool  Integration  (FMICS-jER0  platform  to  support  model 
design  recovery  with  Smyle  [157].  While  the  CORBA  interface  is  currently  exposed 
on  the  Internet  and  sample  uses  are  available  in  example  source  code;  the  source  code 
for  the  library  does  not  appear  to  be  available  to  the  public. 

B.3. 2  Mical.  Mical  is  a  C++  library  of  grammatical  inference  algorithms 
including  fc-TSSI,  fc-RI,  MGGI,  and  RPNI  [213].  The  library  extensively  utilizes  C++ 
template  mechanisms  that  are  compatible  with  GCC(^]  version  3.2  and  version  3.3  but 
not  newer  versions  of  the  GCC  toolchain  (currently  version  4.2.2).  The  only  publicly 
released  version  of  the  software  is  version  0.1  from  10  July  2003  [213].  The  library 
also  implements  a  range  of  functions  internally  including:  creation  of  MCA,  creation 
of  PTA,  and  conversion  of  MCA  to  PTA  [213].  We  attempted,  unsuccessfully,  to 
build  the  library  with  relaxed  C++  dialect  options  Q  available  in  GCC  version  4.2.2. 

3LearnLib  http : //f aelis . cs .uni-dortmund . de/ 

4CORBA  FAQ  http://www.omg.org/gettingstarted/corbafaq.htm 
5FMICS-jETI  -  http:  //jeti  . cs . uni-dortmund . de/ f mics/ 

6GCC,  the  GNU  Gompiler  Gollection  -  http://gcc.gnu.org 
7-fpermissive,  etc. . .,  see  [85,  Section  3.5J 
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We  were  able  to  build  and  evaluate  the  library  by  creating  a  parallel  installation 
of  GCC  version  3.3.6.  Mical  uses  a  modified  version  of  the  Vaucanson  automata 
library  included  with  its  source  code  and  is  not  compatible  with  current  versions  of 
Vaucanson  [86]. 

B.3.3  Other  Implementations.  Implementations  of  several  other  algorithms 
publicly  available.  Ammons  provides  C  source  code  for  h-tails  and  sk-strings  for  PFSA 
inference  [6].  C  source  code  for  Algeria,  a  stochastic  automata  inference  algorithm  [94], 
RPNI,  and  h- tails  is  available  from  [225].  Mariani  and  Pezze  provide  source  for 
kBehavior  and  kTail^J  Their  algorithms  are  implemented  in  Java  [158, 198]  and  use 
JFLAP  version  4  for  automata  support  [219]. 

B.4  Implementation  Language 

The  choice  of  implementation  language  was  driven  by  the  selection  of  support 
packages  and  our  familiarity  with  C  and  C++.  The  choice  of  development  environ¬ 
ment  was  driven  by  the  implementation  language.  We  primarily  used  Eclips^]  version 
3.2,  KDevelopp^l version  3.5.0,  and  various  text  editors  on  an  openSUSE p] version  10.3 
Linux  computer.  We  used  Concurrent  Versions  System  (CVS)  for  revision  control. 


B.5  Low  Level  Implementation 

Our  proof  of  concept  low  level  implementation  is  comprised  for  two  programs, 
flowtool  and  flowinfer,  and  associated  driver  scripts.  Flowtool  process  raw  pcap  trace 
hies  and  produces  alphabet  and  sample  string  hies  for  input  into  howinfcr. 


B.5. 1  flowtool.  Flowtool  is  a  C  language  program  using  GLib  version  2.1 4 


12 


standard  data  structures  for  dynamic  storage  and  libnids  for  TCP  connection  re- 


8k-tails  is  also  covered  by  [991 
9Eclipse  -  http://www.eclipse.org 
10KDevelop  http :  // www .  kdevelop .  org/ 

11openSUSE  -  http://www.opensuse.org 

12GLib  Reference  Manual  -  http://library.gnome.Org/devel/glib/2.14/ 
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Figure  B.l:  Flowtool  verbose  sample  output. 


#  Sample  Type-6 

#  Protocol:  25  Server  172.16.114.50  Client  152.204.242.193.1941 

#  Key  =  172.16.114.50:25,152.204.242.193:1941:  923693227.228408 

#  Start:  923693227.228408 

#  End:  923693238.320118 

#  Wireshark  filter: (ip. addr  eq  172.16.114.50  and  ip.addr  eq  152.204.242.193)  and  (tcp.port  eq  25  and  tcp.port  eq  1941) 
-  TCPopen  220  250  250  250  503  HELP  500  221  TCPclose 


#  Sample  Type-32 

#  Protocol:  25  Server  172.16.114.50  Client  202.49.244.10.1027 

#  Key  =  172.16.114.50:25,202.49.244.10:1027:  922715292.597701 

#  Start:  922715292.597701 

#  End:  922715294.666466 

#  Wireshark  filter: (ip. addr  eq  172.16.114.50  and  ip.addr  eq  202.49.244.10)  and  (tcp.port  eq  25  and  tcp.port  eq  1027) 
-  TCPopen  220  250  250  503  HELP  500  221  TCPclose 


#  Sample  Type-32 

#  Protocol:  25  Server  172.16.114.50  Client  202.49.244.10.1027 

#  Key  =  172.16.114.50:25,202.49.244.10:1027:  922715290.284924 

#  Start:  922715290.284924 

#  End:  922715292.352719 

#  Wireshark  filter: (ip. addr  eq  172.16.114.50  and  ip.addr  eq  202.49.244.10)  and  (tcp.port  eq  25  and  tcp.port  eq  1027) 
-  TCPopen  220  250  250  503  HELP  500  221  TCPclose 


assembly  and  defragmentation.  An  elided  call  graph  from  main  of  flowtool  is  shown 
in  Figure  |B.2[ 


As  the  libnids  library  creates  the  connection  level  flow  the  application  session 
samples  are  incrementally  created  by  parsing  the  application  data  flow.  As  previously 
mentioned,  because  the  code  is  a  proof  of  concept  we  chose  to  implement  hand-coded 
operator  parsers  derived  from  specification  documents  and  heuristics  from  Wireshark 
and  Bro.  Once  the  application  level  traffic  is  parsed  we  write  out  the  alphabet  of 
the  protocol  under  consideration  and  the  sample  strings  of  operators,  replies,  and  the 
composite  of  operators  intermixed  with  replies.  The  different  sample  types  are  written 


to  separate  flat  text  hies.  Flowtool  verbose  sample  output  is  shown  in  Figure  B.l 


Flowtool  was  developed  using  the  KDevelop  IDE  and  autoconf  build  system. 
The  C  source  code  supports  the  Doxygen  documentation  system. 


B.5.2  flowinfer.  We  developed  the  C++  language  program  called  howinfcr 
for  automata  inference.  We  chose  the  mical  GI  toolkit  as  the  basis  of  flowinfer.  Our 
choice  of  mical  was  driven  by  the  fact  that  it  was  the  only  GI  toolkit  that  supported 
multi-letter  alphabets.  Additionally,  mical  was  the  only  toolkit  with  an  integrated 
automata  toolkit.  The  program  processes  the  output  hat  text  hies  from  howinfcr  using 
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TCP  Connection  Event 


Figure  B.2:  Flowtool  call  graph  (elided)  -  libnids  creates  a 

packet  processing  event  loop  that  dispatches  events  to  registered 
callback  functions.  The  callback  function  named  detect_edge  is 
registered  for  TCP  connection  events.  The  detect_edge  callback 
determines  if  the  connection  is  on  a  port  of  interest  and  calls 
the  appropriate  protocol  data  format  parser  in  either  do_smtp 
or  do_pop3. 
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micals  fc-RI  and  fc-TSSI  algorithm  implementations  to  generate  Vaucanson  automata. 
The  automata  are  output  as  Graph  Vizp3]  dot  hies  and  post-processed  to  graphics 
by  a  driver  script  called  flowtooLdata.  The  driver  script  builds  the  command  line 
parameters  and  invokes  flowinfer  for  the  selected  algorithms  for  k  values  1  to  5.  A 
summary  of  the  automata  states,  edges,  initial  states,  and  terminal/final  states  is 
generated  from  the  output  of  flowinfer  for  analysis  purposes.  Low-level  data  structures 
in  mical  and  the  integrated  version  of  Vaucanson  are  implemented  using  the  C++ 
standard  template  library. 

Flowinfer  was  developed  using  the  KDevelop  IDE  and  autoconf  build  system. 
The  C++  source  code  supports  the  Doxygen  documentation  system. 


13GraphViz  -  http : // www . graphviz . org 
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Appendix  C.  Inferred  Automaton 

This  appendix  presents  the  automaton  inferred  by  the  k-Rl  and  /c-TSSI  algorithms 
for  the  subset  of  the  POP3  protocols  exercised  by  the  IDEVAL  data  set.  Inferred 
SMTP  automaton  are  not  presented  here  because  the  graphs  were  not  legible  if  we 
scaled  the  figures  to  fit  on  a  single  page. 
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Figure  C.l:  fc-TSSI  POP3  Composite  Final  Automaton  k  —  1. 
fc-TSSI  with  k  —  1  over-restricts  the  target  automaton.  Session 
initiation  and  session  termination  are  not  inferred  as  separate 
states. 


122 


kRI_PO P3_C omposite_final_  {  15  states,  15  edges,  #1  =  1,  #T  =  1  J 

Figure  C.2:  A;-RI  POP3  Composite  Final  Automaton  k  —  1. 
k-Rl  inference  exactly  matches  the  target  automaton  for  the 
subset  of  POP3  specification  exercised  by  the  IDEVAL  data  set. 
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Figure  C.3:  k-Kl  POP3  Composite  Final  Automaton  k  =  2. 
k-Rl  inference  over-generalizes  the  target  automaton. 
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IrC  Pope  n 


kTSSI_P0P3_Composite_final_  {  16  states,  17  edges,  #1  =  1,  #T  =  1  } 

Figure  C.4:  /e-TSSI  POP3  Composite  Final  Automaton  k  —  2. 
fc-TSSI  with  k  —  2  exactly  matches  the  k-Rl  inference  with 
k  —  2.  The  hypothesis  automaton  over-generalizes  the  target 
automaton. 
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Figure  C.5:  k-Rl  POP3  Composite  Final  Automaton  k  —  3. 
k-Rl  inference  continues  to  over-generalizes  the  target  automa¬ 
ton. 
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[re  Pope  n 


Figure  C.6:  fc-TSSI  POP3  Composite  Final  Automaton  k  =  3. 
A;-TSSI  with  k  —  3  exactly  matches  the  A;-RI  inference  with 
k  —  3.  It  also  over-generalizes  the  target  automaton. 
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